NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit https://nbisweden.github.io/AGAT/
GNU General Public License v3.0
468 stars 56 forks source link

No gene name when conversion from gff3 to gtf #484

Closed desmodus1984 closed 3 months ago

desmodus1984 commented 3 months ago

Describe the bug A clear and concise description of what the bug is.

General (please complete the following information): AGAT 1.4.0

Scripts and parameters to reproduce the behavior. agat_convert_sp_gff2gtf.pl --gff MitoBVos.gff3 -o MitoBVos.gtf

Input file description (made with this tool, downloaded from this ressource) Gff3 downloaded from NCBI

Expected behavior I used this gtf to make the STAR index for read mapping, and I expected the gene names to match those in the original gff3 file; but the output of GeneCounts from STAR has weird/vague/useless names:

N_unmapped 19624491 19624491 19624491 N_multimapping 1171 1171 1171 N_noFeature 1050 726463 192370 N_ambiguous 5143 433 4279 agat-gene-1 7 13 2 agat-gene-2 6 2 11 agat-gene-3 2 1 1 agat-gene-4 790 120 684 agat-gene-5 1 12 0 agat-gene-6 9 9 0 agat-gene-7 138981 133349 6023 agat-gene-8 15186 499 15078 agat-gene-9 0 0 0 agat-gene-10 9803 2182 7621 agat-gene-11 29 10 19 agat-gene-12 0 0 8 agat-gene-13 0 1 0 agat-gene-14 0 1 0 agat-gene-15 0 0 0 agat-gene-16 10698 1445 9253 agat-gene-17 38905 1874 37031 agat-gene-18 1 1 0 agat-gene-19 46999 10685 36314 agat-gene-20 4061 3537 549 agat-gene-21 0 4 2 agat-gene-22 585645 33359 552299 agat-gene-23 60897 4222 56675 agat-gene-24 15 6 9

Could you tell why the problem with the gff3 file that the gtf doesn't have the proper names?

Juke34 commented 3 months ago

If you observe you file before and after you will see that AGAT create missing gene features. It creates random IDs as agat-gene4 because feature present i.e tRNA do not say what was the name of the Parent.

BK063639.1  tpg tRNA    1173    1242    .   -   .   ID=rna-BK063639.1:1173..1242;gbkey=tRNA;product=tRNA-Met
BK063639.1  tpg exon    1173    1242    .   -   .   ID=exon-BK063639.1:1173..1242-1;Parent=rna-BK063639.1:1173..1242;gbkey=tRNA;product=tRNA-Met
BK063639.1  tpg tRNA    1350    1417    .   +   .   ID=rna-BK063639.1:1350..1417;gbkey=tRNA;product=tRNA-Ala
BK063639.1  tpg exon    1350    1417    .   +   .   ID=exon-BK063639.1:1350..1417-1;Parent=rna-BK063639.1:1350..1417;gbkey=tRNA;product=tRNA-Ala
BK063639.1  tpg tRNA    1556    1625    .   +   .   ID=rna-BK063639.1:1556..1625;gbkey=tRNA;product=tRNA-Met
BK063639.1  tpg exon    1556    1625    .   +   .   ID=exon-BK063639.1:1556..1625-1;Parent=rna-BK063639.1:1556..1625;gbkey=tRNA;product=tRNA-Met
BK063639.1  tpg tRNA    1637    1705    .   +   .   ID=rna-BK063639.1:1637..1705;gbkey=tRNA;product=tRNA-Ile
BK063639.1  tpg exon    1637    1705    .   +   .   ID=exon-BK063639.1:1637..1705-1;Parent=rna-BK063639.1:1637..1705;gbkey=tRNA;product=tRNA-Ile
BK063639.1  tpg CDS 1790    2779    .   +   0   ID=cds-DBA43806.1;Dbxref=NCBI_GP:DBA43806.1;Name=DBA43806.1;gbkey=CDS;product=ND2;protein_id=DBA43806.1;transl_table=5
BK063639.1  tpg tRNA    2768    2840    .   -   .   ID=rna-BK063639.1:2768..2840;gbkey=tRNA;product=tRNA-Cys
BK063639.1  tpg exon    2768    2840    .   -   .   ID=exon-BK063639.1:2768..2840-1;Parent=rna-BK063639.1:2768..2840;gbkey=tRNA;product=tRNA-Cys
BK063639.1  tpg tRNA    2861    2929    .   -   .   ID=rna-BK063639.1:2861..2929;gbkey=tRNA;product=tRNA-Tyr
BK063639.1  tpg exon    2861    2929    .   -   .   ID=exon-BK063639.1:2861..2929-1;Parent=rna-BK063639.1:2861..2929;gbkey=tRNA;product=tRNA-Tyr
BK063639.1  tpg tRNA    3030    3098    .   -   .   ID=rna-BK063639.1:3030..3098;gbkey=tRNA;product=tRNA-Trp
BK063639.1  tpg exon    3030    3098    .   -   .   ID=exon-BK063639.1:3030..3098-1;Parent=rna-BK063639.1:3030..3098;gbkey=tRNA;product=tRNA-Trp
BK063639.1  tpg CDS 3114    4658    .   +   0   ID=cds-DBA43807.1;Dbxref=NCBI_GP:DBA43807.1;Name=DBA43807.1;gbkey=CDS;product=COX1;protein_id=DBA43807.1;transl_table=5
BK063639.1  tpg tRNA    4708    4774    .   +   .   ID=rna-BK063639.1:4708..4774;gbkey=tRNA;product=tRNA-Leu
BK063639.1  tpg exon    4708    4774    .   +   .   ID=exon-BK063639.1:4708..4774-1;Parent=rna-BK063639.1:4708..4774;gbkey=tRNA;product=tRNA-Leu
BK063639.1  tpg CDS 4835    5510    .   +   0   ID=cds-DBA43808.1;Dbxref=NCBI_GP:DBA43808.1;Name=DBA43808.1;Note=TAA stop codon is completed by the addition of 3' A residues to the mRNA;gbkey=CDS;product=COX2;protein_id=DBA43808.1;transl_except=(pos:5510..5510%2Caa:TERM);transl_table=5
BK063639.1  AGAT    gene    1173    1242    .   -   .   gene_id "agat-gene-1"; ID "agat-gene-1"; gbkey "tRNA"; product "tRNA-Met";
BK063639.1  tpg tRNA    1173    1242    .   -   .   gene_id "agat-gene-1"; transcript_id "rna-BK063639.1:1173..1242"; ID "rna-BK063639.1:1173..1242"; Parent "agat-gene-1"; gbkey "tRNA"; product "tRNA-Met";
BK063639.1  tpg exon    1173    1242    .   -   .   gene_id "agat-gene-1"; transcript_id "rna-BK063639.1:1173..1242"; ID "exon-BK063639.1:1173..1242-1"; Parent "rna-BK063639.1:1173..1242"; gbkey "tRNA"; product "tRNA-Met";
BK063639.1  AGAT    gene    1350    1417    .   +   .   gene_id "agat-gene-2"; ID "agat-gene-2"; gbkey "tRNA"; product "tRNA-Ala";
BK063639.1  tpg tRNA    1350    1417    .   +   .   gene_id "agat-gene-2"; transcript_id "rna-BK063639.1:1350..1417"; ID "rna-BK063639.1:1350..1417"; Parent "agat-gene-2"; gbkey "tRNA"; product "tRNA-Ala";
BK063639.1  tpg exon    1350    1417    .   +   .   gene_id "agat-gene-2"; transcript_id "rna-BK063639.1:1350..1417"; ID "exon-BK063639.1:1350..1417-1"; Parent "rna-BK063639.1:1350..1417"; gbkey "tRNA"; product "tRNA-Ala";
BK063639.1  AGAT    gene    1556    1625    .   +   .   gene_id "agat-gene-3"; ID "agat-gene-3"; gbkey "tRNA"; product "tRNA-Met";
BK063639.1  tpg tRNA    1556    1625    .   +   .   gene_id "agat-gene-3"; transcript_id "rna-BK063639.1:1556..1625"; ID "rna-BK063639.1:1556..1625"; Parent "agat-gene-3"; gbkey "tRNA"; product "tRNA-Met";
BK063639.1  tpg exon    1556    1625    .   +   .   gene_id "agat-gene-3"; transcript_id "rna-BK063639.1:1556..1625"; ID "exon-BK063639.1:1556..1625-1"; Parent "rna-BK063639.1:1556..1625"; gbkey "tRNA"; product "tRNA-Met";
BK063639.1  AGAT    gene    1637    2779    .   +   .   gene_id "agat-gene-4"; ID "agat-gene-4"; gbkey "tRNA"; product "tRNA-Ile";
BK063639.1  tpg tRNA    1637    2779    .   +   .   gene_id "agat-gene-4"; transcript_id "rna-BK063639.1:1637..1705"; ID "rna-BK063639.1:1637..1705"; Parent "agat-gene-4"; gbkey "tRNA"; product "tRNA-Ile";
BK063639.1  tpg exon    1637    1705    .   +   .   gene_id "agat-gene-4"; transcript_id "rna-BK063639.1:1637..1705"; ID "exon-BK063639.1:1637..1705-1"; Parent "rna-BK063639.1:1637..1705"; gbkey "tRNA"; product "tRNA-Ile";
BK063639.1  AGAT    exon    1790    2779    .   +   .   gene_id "agat-gene-4"; transcript_id "rna-BK063639.1:1637..1705"; Dbxref "NCBI_GP:DBA43806.1"; ID "agat-exon-3"; Name "DBA43806.1"; Parent "rna-BK063639.1:1637..1705"; gbkey "CDS"; product "ND2"; protein_id "DBA43806.1"; transl_table "5";
BK063639.1  tpg CDS 1790    2779    .   +   0   gene_id "agat-gene-4"; transcript_id "rna-BK063639.1:1637..1705"; Dbxref "NCBI_GP:DBA43806.1"; ID "cds-DBA43806.1"; Name "DBA43806.1"; Parent "rna-BK063639.1:1637..1705"; gbkey "CDS"; product "ND2"; protein_id "DBA43806.1"; transl_table "5";
BK063639.1  AGAT    five_prime_UTR  1637    1705    .   +   .   gene_id "agat-gene-4"; transcript_id "rna-BK063639.1:1637..1705"; ID "agat-five_prime_utr-3"; Parent "rna-BK063639.1:1637..1705"; gbkey "tRNA"; product "tRNA-Ile";
BK063639.1  AGAT    gene    2768    2840    .   -   .   gene_id "agat-gene-5"; ID "agat-gene-5"; gbkey "tRNA"; product "tRNA-Cys";
BK063639.1  tpg tRNA    2768    2840    .   -   .   gene_id "agat-gene-5"; transcript_id "rna-BK063639.1:2768..2840"; ID "rna-BK063639.1:2768..2840"; Parent "agat-gene-5"; gbkey "tRNA"; product "tRNA-Cys";
BK063639.1  tpg exon    2768    2840    .   -   .   gene_id "agat-gene-5"; transcript_id "rna-BK063639.1:2768..2840"; ID "exon-BK063639.1:2768..2840-1"; Parent "rna-BK063639.1:2768..2840"; gbkey "tRNA"; product "tRNA-Cys";
BK063639.1  AGAT    gene    2861    2929    .   -   .   gene_id "agat-gene-6"; ID "agat-gene-6"; gbkey "tRNA"; product "tRNA-Tyr";
BK063639.1  tpg tRNA    2861    2929    .   -   .   gene_id "agat-gene-6"; transcript_id "rna-BK063639.1:2861..2929"; ID "rna-BK063639.1:2861..2929"; Parent "agat-gene-6"; gbkey "tRNA"; product "tRNA-Tyr";
BK063639.1  tpg exon    2861    2929    .   -   .   gene_id "agat-gene-6"; transcript_id "rna-BK063639.1:2861..2929"; ID "exon-BK063639.1:2861..2929-1"; Parent "rna-BK063639.1:2861..2929"; gbkey "tRNA"; product "tRNA-Tyr";
BK063639.1  AGAT    gene    3030    4658    .   -   .   gene_id "agat-gene-7"; ID "agat-gene-7"; gbkey "tRNA"; product "tRNA-Trp";
BK063639.1  tpg tRNA    3030    4658    .   -   .   gene_id "agat-gene-7"; transcript_id "rna-BK063639.1:3030..3098"; ID "rna-BK063639.1:3030..3098"; Parent "agat-gene-7"; gbkey "tRNA"; product "tRNA-Trp";
BK063639.1  tpg exon    3030    3098    .   -   .   gene_id "agat-gene-7"; transcript_id "rna-BK063639.1:3030..3098"; ID "exon-BK063639.1:3030..3098-1"; Parent "rna-BK063639.1:3030..3098"; gbkey "tRNA"; product "tRNA-Trp";
BK063639.1  AGAT    exon    3114    4658    .   +   .   gene_id "agat-gene-7"; transcript_id "rna-BK063639.1:3030..3098"; Dbxref "NCBI_GP:DBA43807.1"; ID "agat-exon-4"; Name "DBA43807.1"; Parent "rna-BK063639.1:3030..3098"; gbkey "CDS"; product "COX1"; protein_id "DBA43807.1"; transl_table "5";
BK063639.1  tpg CDS 3114    4658    .   +   0   gene_id "agat-gene-7"; transcript_id "rna-BK063639.1:3030..3098"; Dbxref "NCBI_GP:DBA43807.1"; ID "cds-DBA43807.1"; Name "DBA43807.1"; Parent "rna-BK063639.1:3030..3098"; gbkey "CDS"; product "COX1"; protein_id "DBA43807.1"; transl_table "5";
BK063639.1  AGAT    three_prime_UTR 3030    3098    .   -   .   gene_id "agat-gene-7"; transcript_id "rna-BK063639.1:3030..3098"; ID "agat-three_prime_utr-1"; Parent "rna-BK063639.1:3030..3098"; gbkey "tRNA"; product "tRNA-Trp";
BK063639.1  AGAT    gene    4708    5510    .   +   .   gene_id "agat-gene-8"; ID "agat-gene-8"; gbkey "tRNA"; product "tRNA-Leu";
BK063639.1  tpg tRNA    4708    5510    .   +   .   gene_id "agat-gene-8"; transcript_id "rna-BK063639.1:4708..4774"; ID "rna-BK063639.1:4708..4774"; Parent "agat-gene-8"; gbkey "tRNA"; product "tRNA-Leu";
BK063639.1  tpg exon    4708    4774    .   +   .   gene_id "agat-gene-8"; transcript_id "rna-BK063639.1:4708..4774"; ID "exon-BK063639.1:4708..4774-1"; Parent "rna-BK063639.1:4708..4774"; gbkey "tRNA"; product "tRNA-Leu";
BK063639.1  AGAT    exon    4835    5510    .   +   .   gene_id "agat-gene-8"; transcript_id "rna-BK063639.1:4708..4774"; Dbxref "NCBI_GP:DBA43808.1"; ID "agat-exon-5"; Name "DBA43808.1"; Note "TAA stop codon is completed by the addition of 3' A residues to the mRNA"; Parent "rna-BK063639.1:4708..4774"; gbkey "CDS"; product "COX2"; protein_id "DBA43808.1"; transl_except "(pos:5510..5510,aa:TERM)"; transl_table "5";
BK063639.1  tpg CDS 4835    5510    .   +   0   gene_id "agat-gene-8"; transcript_id "rna-BK063639.1:4708..4774"; Dbxref "NCBI_GP:DBA43808.1"; ID "cds-DBA43808.1"; Name "DBA43808.1"; Note "TAA stop codon is completed by the addition of 3' A residues to the mRNA"; Parent "rna-BK063639.1:4708..4774"; gbkey "CDS"; product "COX2"; protein_id "DBA43808.1"; transl_except "(pos:5510..5510,aa:TERM)"; transl_table "5";
BK063639.1  AGAT    five_prime_UTR  4708    4774    .   +   .   gene_id "agat-gene-8"; transcript_id "rna-BK063639.1:4708..4774"; ID "agat-five_prime_utr-4"; Parent "rna-BK063639.1:4708..4774"; gbkey "tRNA"; product "tRNA-Leu";

But there is a major problem in your file. AGAT is based on eukaryotic dogma (not problematic anyway for prokaryote):

AGAT perform well for the tRNA creating the missing gene Parent but for the CDS it attaches it to the tRNA parent because it is defined after without Parent information. I would suggest to perform the following (only because CDS are defined alone):

# Deal with what is not CDS
awk '{if ($3!="CDS") print  $0 }' sequence.gff3 > sequence.other.gff3
agat_convert_sp_gff2gtf.pl --gff sequence.other.gff3 -o sequence.other.agat.gff

# Deal with what is CDS
awk '{if ($3=="CDS") print  $0 }' sequence.gff3 > sequence.cds.gff3 
agat config --expose --locus_tag Name
agat_convert_sp_gff2gtf.pl --gff sequence.cds.gff3 -o sequence.cds.agat.gff

# Compile results
cp sequence.other.agat.gff  sequence.all.agat.gff 
sh-5.2# tail -n+3  sequence.cds.agat.gff >> sequence.all.agat.gff

And you will be good to go. I will see if I can add extra knowledge in AGAT to avoid this complexity / issues