Closed desmodus1984 closed 3 months ago
If you observe you file before and after you will see that AGAT create missing gene features.
It creates random IDs as agat-gene4
because feature present i.e tRNA do not say what was the name of the Parent
.
BK063639.1 tpg tRNA 1173 1242 . - . ID=rna-BK063639.1:1173..1242;gbkey=tRNA;product=tRNA-Met
BK063639.1 tpg exon 1173 1242 . - . ID=exon-BK063639.1:1173..1242-1;Parent=rna-BK063639.1:1173..1242;gbkey=tRNA;product=tRNA-Met
BK063639.1 tpg tRNA 1350 1417 . + . ID=rna-BK063639.1:1350..1417;gbkey=tRNA;product=tRNA-Ala
BK063639.1 tpg exon 1350 1417 . + . ID=exon-BK063639.1:1350..1417-1;Parent=rna-BK063639.1:1350..1417;gbkey=tRNA;product=tRNA-Ala
BK063639.1 tpg tRNA 1556 1625 . + . ID=rna-BK063639.1:1556..1625;gbkey=tRNA;product=tRNA-Met
BK063639.1 tpg exon 1556 1625 . + . ID=exon-BK063639.1:1556..1625-1;Parent=rna-BK063639.1:1556..1625;gbkey=tRNA;product=tRNA-Met
BK063639.1 tpg tRNA 1637 1705 . + . ID=rna-BK063639.1:1637..1705;gbkey=tRNA;product=tRNA-Ile
BK063639.1 tpg exon 1637 1705 . + . ID=exon-BK063639.1:1637..1705-1;Parent=rna-BK063639.1:1637..1705;gbkey=tRNA;product=tRNA-Ile
BK063639.1 tpg CDS 1790 2779 . + 0 ID=cds-DBA43806.1;Dbxref=NCBI_GP:DBA43806.1;Name=DBA43806.1;gbkey=CDS;product=ND2;protein_id=DBA43806.1;transl_table=5
BK063639.1 tpg tRNA 2768 2840 . - . ID=rna-BK063639.1:2768..2840;gbkey=tRNA;product=tRNA-Cys
BK063639.1 tpg exon 2768 2840 . - . ID=exon-BK063639.1:2768..2840-1;Parent=rna-BK063639.1:2768..2840;gbkey=tRNA;product=tRNA-Cys
BK063639.1 tpg tRNA 2861 2929 . - . ID=rna-BK063639.1:2861..2929;gbkey=tRNA;product=tRNA-Tyr
BK063639.1 tpg exon 2861 2929 . - . ID=exon-BK063639.1:2861..2929-1;Parent=rna-BK063639.1:2861..2929;gbkey=tRNA;product=tRNA-Tyr
BK063639.1 tpg tRNA 3030 3098 . - . ID=rna-BK063639.1:3030..3098;gbkey=tRNA;product=tRNA-Trp
BK063639.1 tpg exon 3030 3098 . - . ID=exon-BK063639.1:3030..3098-1;Parent=rna-BK063639.1:3030..3098;gbkey=tRNA;product=tRNA-Trp
BK063639.1 tpg CDS 3114 4658 . + 0 ID=cds-DBA43807.1;Dbxref=NCBI_GP:DBA43807.1;Name=DBA43807.1;gbkey=CDS;product=COX1;protein_id=DBA43807.1;transl_table=5
BK063639.1 tpg tRNA 4708 4774 . + . ID=rna-BK063639.1:4708..4774;gbkey=tRNA;product=tRNA-Leu
BK063639.1 tpg exon 4708 4774 . + . ID=exon-BK063639.1:4708..4774-1;Parent=rna-BK063639.1:4708..4774;gbkey=tRNA;product=tRNA-Leu
BK063639.1 tpg CDS 4835 5510 . + 0 ID=cds-DBA43808.1;Dbxref=NCBI_GP:DBA43808.1;Name=DBA43808.1;Note=TAA stop codon is completed by the addition of 3' A residues to the mRNA;gbkey=CDS;product=COX2;protein_id=DBA43808.1;transl_except=(pos:5510..5510%2Caa:TERM);transl_table=5
BK063639.1 AGAT gene 1173 1242 . - . gene_id "agat-gene-1"; ID "agat-gene-1"; gbkey "tRNA"; product "tRNA-Met";
BK063639.1 tpg tRNA 1173 1242 . - . gene_id "agat-gene-1"; transcript_id "rna-BK063639.1:1173..1242"; ID "rna-BK063639.1:1173..1242"; Parent "agat-gene-1"; gbkey "tRNA"; product "tRNA-Met";
BK063639.1 tpg exon 1173 1242 . - . gene_id "agat-gene-1"; transcript_id "rna-BK063639.1:1173..1242"; ID "exon-BK063639.1:1173..1242-1"; Parent "rna-BK063639.1:1173..1242"; gbkey "tRNA"; product "tRNA-Met";
BK063639.1 AGAT gene 1350 1417 . + . gene_id "agat-gene-2"; ID "agat-gene-2"; gbkey "tRNA"; product "tRNA-Ala";
BK063639.1 tpg tRNA 1350 1417 . + . gene_id "agat-gene-2"; transcript_id "rna-BK063639.1:1350..1417"; ID "rna-BK063639.1:1350..1417"; Parent "agat-gene-2"; gbkey "tRNA"; product "tRNA-Ala";
BK063639.1 tpg exon 1350 1417 . + . gene_id "agat-gene-2"; transcript_id "rna-BK063639.1:1350..1417"; ID "exon-BK063639.1:1350..1417-1"; Parent "rna-BK063639.1:1350..1417"; gbkey "tRNA"; product "tRNA-Ala";
BK063639.1 AGAT gene 1556 1625 . + . gene_id "agat-gene-3"; ID "agat-gene-3"; gbkey "tRNA"; product "tRNA-Met";
BK063639.1 tpg tRNA 1556 1625 . + . gene_id "agat-gene-3"; transcript_id "rna-BK063639.1:1556..1625"; ID "rna-BK063639.1:1556..1625"; Parent "agat-gene-3"; gbkey "tRNA"; product "tRNA-Met";
BK063639.1 tpg exon 1556 1625 . + . gene_id "agat-gene-3"; transcript_id "rna-BK063639.1:1556..1625"; ID "exon-BK063639.1:1556..1625-1"; Parent "rna-BK063639.1:1556..1625"; gbkey "tRNA"; product "tRNA-Met";
BK063639.1 AGAT gene 1637 2779 . + . gene_id "agat-gene-4"; ID "agat-gene-4"; gbkey "tRNA"; product "tRNA-Ile";
BK063639.1 tpg tRNA 1637 2779 . + . gene_id "agat-gene-4"; transcript_id "rna-BK063639.1:1637..1705"; ID "rna-BK063639.1:1637..1705"; Parent "agat-gene-4"; gbkey "tRNA"; product "tRNA-Ile";
BK063639.1 tpg exon 1637 1705 . + . gene_id "agat-gene-4"; transcript_id "rna-BK063639.1:1637..1705"; ID "exon-BK063639.1:1637..1705-1"; Parent "rna-BK063639.1:1637..1705"; gbkey "tRNA"; product "tRNA-Ile";
BK063639.1 AGAT exon 1790 2779 . + . gene_id "agat-gene-4"; transcript_id "rna-BK063639.1:1637..1705"; Dbxref "NCBI_GP:DBA43806.1"; ID "agat-exon-3"; Name "DBA43806.1"; Parent "rna-BK063639.1:1637..1705"; gbkey "CDS"; product "ND2"; protein_id "DBA43806.1"; transl_table "5";
BK063639.1 tpg CDS 1790 2779 . + 0 gene_id "agat-gene-4"; transcript_id "rna-BK063639.1:1637..1705"; Dbxref "NCBI_GP:DBA43806.1"; ID "cds-DBA43806.1"; Name "DBA43806.1"; Parent "rna-BK063639.1:1637..1705"; gbkey "CDS"; product "ND2"; protein_id "DBA43806.1"; transl_table "5";
BK063639.1 AGAT five_prime_UTR 1637 1705 . + . gene_id "agat-gene-4"; transcript_id "rna-BK063639.1:1637..1705"; ID "agat-five_prime_utr-3"; Parent "rna-BK063639.1:1637..1705"; gbkey "tRNA"; product "tRNA-Ile";
BK063639.1 AGAT gene 2768 2840 . - . gene_id "agat-gene-5"; ID "agat-gene-5"; gbkey "tRNA"; product "tRNA-Cys";
BK063639.1 tpg tRNA 2768 2840 . - . gene_id "agat-gene-5"; transcript_id "rna-BK063639.1:2768..2840"; ID "rna-BK063639.1:2768..2840"; Parent "agat-gene-5"; gbkey "tRNA"; product "tRNA-Cys";
BK063639.1 tpg exon 2768 2840 . - . gene_id "agat-gene-5"; transcript_id "rna-BK063639.1:2768..2840"; ID "exon-BK063639.1:2768..2840-1"; Parent "rna-BK063639.1:2768..2840"; gbkey "tRNA"; product "tRNA-Cys";
BK063639.1 AGAT gene 2861 2929 . - . gene_id "agat-gene-6"; ID "agat-gene-6"; gbkey "tRNA"; product "tRNA-Tyr";
BK063639.1 tpg tRNA 2861 2929 . - . gene_id "agat-gene-6"; transcript_id "rna-BK063639.1:2861..2929"; ID "rna-BK063639.1:2861..2929"; Parent "agat-gene-6"; gbkey "tRNA"; product "tRNA-Tyr";
BK063639.1 tpg exon 2861 2929 . - . gene_id "agat-gene-6"; transcript_id "rna-BK063639.1:2861..2929"; ID "exon-BK063639.1:2861..2929-1"; Parent "rna-BK063639.1:2861..2929"; gbkey "tRNA"; product "tRNA-Tyr";
BK063639.1 AGAT gene 3030 4658 . - . gene_id "agat-gene-7"; ID "agat-gene-7"; gbkey "tRNA"; product "tRNA-Trp";
BK063639.1 tpg tRNA 3030 4658 . - . gene_id "agat-gene-7"; transcript_id "rna-BK063639.1:3030..3098"; ID "rna-BK063639.1:3030..3098"; Parent "agat-gene-7"; gbkey "tRNA"; product "tRNA-Trp";
BK063639.1 tpg exon 3030 3098 . - . gene_id "agat-gene-7"; transcript_id "rna-BK063639.1:3030..3098"; ID "exon-BK063639.1:3030..3098-1"; Parent "rna-BK063639.1:3030..3098"; gbkey "tRNA"; product "tRNA-Trp";
BK063639.1 AGAT exon 3114 4658 . + . gene_id "agat-gene-7"; transcript_id "rna-BK063639.1:3030..3098"; Dbxref "NCBI_GP:DBA43807.1"; ID "agat-exon-4"; Name "DBA43807.1"; Parent "rna-BK063639.1:3030..3098"; gbkey "CDS"; product "COX1"; protein_id "DBA43807.1"; transl_table "5";
BK063639.1 tpg CDS 3114 4658 . + 0 gene_id "agat-gene-7"; transcript_id "rna-BK063639.1:3030..3098"; Dbxref "NCBI_GP:DBA43807.1"; ID "cds-DBA43807.1"; Name "DBA43807.1"; Parent "rna-BK063639.1:3030..3098"; gbkey "CDS"; product "COX1"; protein_id "DBA43807.1"; transl_table "5";
BK063639.1 AGAT three_prime_UTR 3030 3098 . - . gene_id "agat-gene-7"; transcript_id "rna-BK063639.1:3030..3098"; ID "agat-three_prime_utr-1"; Parent "rna-BK063639.1:3030..3098"; gbkey "tRNA"; product "tRNA-Trp";
BK063639.1 AGAT gene 4708 5510 . + . gene_id "agat-gene-8"; ID "agat-gene-8"; gbkey "tRNA"; product "tRNA-Leu";
BK063639.1 tpg tRNA 4708 5510 . + . gene_id "agat-gene-8"; transcript_id "rna-BK063639.1:4708..4774"; ID "rna-BK063639.1:4708..4774"; Parent "agat-gene-8"; gbkey "tRNA"; product "tRNA-Leu";
BK063639.1 tpg exon 4708 4774 . + . gene_id "agat-gene-8"; transcript_id "rna-BK063639.1:4708..4774"; ID "exon-BK063639.1:4708..4774-1"; Parent "rna-BK063639.1:4708..4774"; gbkey "tRNA"; product "tRNA-Leu";
BK063639.1 AGAT exon 4835 5510 . + . gene_id "agat-gene-8"; transcript_id "rna-BK063639.1:4708..4774"; Dbxref "NCBI_GP:DBA43808.1"; ID "agat-exon-5"; Name "DBA43808.1"; Note "TAA stop codon is completed by the addition of 3' A residues to the mRNA"; Parent "rna-BK063639.1:4708..4774"; gbkey "CDS"; product "COX2"; protein_id "DBA43808.1"; transl_except "(pos:5510..5510,aa:TERM)"; transl_table "5";
BK063639.1 tpg CDS 4835 5510 . + 0 gene_id "agat-gene-8"; transcript_id "rna-BK063639.1:4708..4774"; Dbxref "NCBI_GP:DBA43808.1"; ID "cds-DBA43808.1"; Name "DBA43808.1"; Note "TAA stop codon is completed by the addition of 3' A residues to the mRNA"; Parent "rna-BK063639.1:4708..4774"; gbkey "CDS"; product "COX2"; protein_id "DBA43808.1"; transl_except "(pos:5510..5510,aa:TERM)"; transl_table "5";
BK063639.1 AGAT five_prime_UTR 4708 4774 . + . gene_id "agat-gene-8"; transcript_id "rna-BK063639.1:4708..4774"; ID "agat-five_prime_utr-4"; Parent "rna-BK063639.1:4708..4774"; gbkey "tRNA"; product "tRNA-Leu";
But there is a major problem in your file. AGAT is based on eukaryotic dogma (not problematic anyway for prokaryote):
gene->mRNA->CDS/exon
or `
gene->tRNA->exon
but in your case you have:
? -> ? -> CDS (no exon)
? -> tRNA->exon
AGAT perform well for the tRNA creating the missing gene Parent but for the CDS it attaches it to the tRNA parent because it is defined after without Parent information. I would suggest to perform the following (only because CDS are defined alone):
# Deal with what is not CDS
awk '{if ($3!="CDS") print $0 }' sequence.gff3 > sequence.other.gff3
agat_convert_sp_gff2gtf.pl --gff sequence.other.gff3 -o sequence.other.agat.gff
# Deal with what is CDS
awk '{if ($3=="CDS") print $0 }' sequence.gff3 > sequence.cds.gff3
agat config --expose --locus_tag Name
agat_convert_sp_gff2gtf.pl --gff sequence.cds.gff3 -o sequence.cds.agat.gff
# Compile results
cp sequence.other.agat.gff sequence.all.agat.gff
sh-5.2# tail -n+3 sequence.cds.agat.gff >> sequence.all.agat.gff
And you will be good to go. I will see if I can add extra knowledge in AGAT to avoid this complexity / issues
Describe the bug A clear and concise description of what the bug is.
General (please complete the following information): AGAT 1.4.0
To Reproduce Hi, I am trying to assess mitochondrial gene expression. I downloaded the fasta and the gff3 from NCBI. https://www.ncbi.nlm.nih.gov/nuccore/BK063639.1
Scripts and parameters to reproduce the behavior. agat_convert_sp_gff2gtf.pl --gff MitoBVos.gff3 -o MitoBVos.gtf
Input file description (made with this tool, downloaded from this ressource) Gff3 downloaded from NCBI
Expected behavior I used this gtf to make the STAR index for read mapping, and I expected the gene names to match those in the original gff3 file; but the output of GeneCounts from STAR has weird/vague/useless names:
N_unmapped 19624491 19624491 19624491 N_multimapping 1171 1171 1171 N_noFeature 1050 726463 192370 N_ambiguous 5143 433 4279 agat-gene-1 7 13 2 agat-gene-2 6 2 11 agat-gene-3 2 1 1 agat-gene-4 790 120 684 agat-gene-5 1 12 0 agat-gene-6 9 9 0 agat-gene-7 138981 133349 6023 agat-gene-8 15186 499 15078 agat-gene-9 0 0 0 agat-gene-10 9803 2182 7621 agat-gene-11 29 10 19 agat-gene-12 0 0 8 agat-gene-13 0 1 0 agat-gene-14 0 1 0 agat-gene-15 0 0 0 agat-gene-16 10698 1445 9253 agat-gene-17 38905 1874 37031 agat-gene-18 1 1 0 agat-gene-19 46999 10685 36314 agat-gene-20 4061 3537 549 agat-gene-21 0 4 2 agat-gene-22 585645 33359 552299 agat-gene-23 60897 4222 56675 agat-gene-24 15 6 9
Could you tell why the problem with the gff3 file that the gtf doesn't have the proper names?