NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
431 stars 52 forks source link

Unknown/additional features in GFF2GTF conversion #467

Closed sivasubramanics closed 1 week ago

sivasubramanics commented 1 week ago

Describe the bug I am trying to perform the conversion of GFF to GTF, of one of the ncbi downloaded reference GFF file. I am finding the converted output GTF introduced additional features for the transcripts. Also, the attribute ID is changed.

General (please complete the following information):

To Reproduce

agat_convert_sp_gff2gtf.pl --gff LSALG_LOCUS11.gff -o LSALG_LOCUS11.gtf

Input file description (downloaded gff for this genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_949927565.1/)

lsal_0  EMBL  gene  1739541 1740699 . - . ID=gene-LSALG_LOCUS11;Name=LSALG_LOCUS11;gbkey=Gene;gene_biotype=protein_coding;locus_tag=LSALG_LOCUS11
lsal_0  EMBL  mRNA  1739541 1740699 . - . ID=rna-LSALG_LOCUS11;Parent=gene-LSALG_LOCUS11;Note=source:maker%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1;gbkey=mRNA;locus_tag=LSALG_LOCUS11
lsal_0  EMBL  exon  1740549 1740699 . - . ID=exon-LSALG_LOCUS11-1;Parent=rna-LSALG_LOCUS11;Note=source:maker%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1;gbkey=mRNA;locus_tag=LSALG_LOCUS11
lsal_0  EMBL  exon  1740169 1740460 . - . ID=exon-LSALG_LOCUS11-2;Parent=rna-LSALG_LOCUS11;Note=source:maker%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1;gbkey=mRNA;locus_tag=LSALG_LOCUS11
lsal_0  EMBL  exon  1740020 1740095 . - . ID=exon-LSALG_LOCUS11-3;Parent=rna-LSALG_LOCUS11;Note=source:maker%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1;gbkey=mRNA;locus_tag=LSALG_LOCUS11
lsal_0  EMBL  exon  1739794 1739918 . - . ID=exon-LSALG_LOCUS11-4;Parent=rna-LSALG_LOCUS11;Note=source:maker%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1;gbkey=mRNA;locus_tag=LSALG_LOCUS11
lsal_0  EMBL  exon  1739541 1739686 . - . ID=exon-LSALG_LOCUS11-5;Parent=rna-LSALG_LOCUS11;Note=source:maker%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1;gbkey=mRNA;locus_tag=LSALG_LOCUS11
lsal_0  EMBL  CDS 1740549 1740692 . - 0 ID=cds-CAI9259098.1;Parent=rna-LSALG_LOCUS11;Dbxref=NCBI_GP:CAI9259098.1;Name=CAI9259098.1;Note=source:maker%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CD
S1%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS2%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS3%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS4%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS5;gbkey=CDS;locus_tag=LSALG_LOCUS11;product=CAI9259098.1;protein_id=CAI9259098.1
lsal_0  EMBL  CDS 1740169 1740460 . - 0 ID=cds-CAI9259098.1;Parent=rna-LSALG_LOCUS11;Dbxref=NCBI_GP:CAI9259098.1;Name=CAI9259098.1;Note=source:maker%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CD
S1%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS2%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS3%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS4%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS5;gbkey=CDS;locus_tag=LSALG_LOCUS11;product=CAI9259098.1;protein_id=CAI9259098.1
lsal_0  EMBL  CDS 1740020 1740095 . - 2 ID=cds-CAI9259098.1;Parent=rna-LSALG_LOCUS11;Dbxref=NCBI_GP:CAI9259098.1;Name=CAI9259098.1;Note=source:maker%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CD
S1%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS2%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS3%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS4%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS5;gbkey=CDS;locus_tag=LSALG_LOCUS11;product=CAI9259098.1;protein_id=CAI9259098.1
lsal_0  EMBL  CDS 1739794 1739918 . - 1 ID=cds-CAI9259098.1;Parent=rna-LSALG_LOCUS11;Dbxref=NCBI_GP:CAI9259098.1;Name=CAI9259098.1;Note=source:maker%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CD
S1%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS2%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS3%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS4%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS5;gbkey=CDS;locus_tag=LSALG_LOCUS11;product=CAI9259098.1;protein_id=CAI9259098.1
lsal_0  EMBL  CDS 1739601 1739686 . - 2 ID=cds-CAI9259098.1;Parent=rna-LSALG_LOCUS11;Dbxref=NCBI_GP:CAI9259098.1;Name=CAI9259098.1;Note=source:maker%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CD
S1%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS2%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS3%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS4%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS5;gbkey=CDS;locus_tag=LSALG_LOCUS11;product=CAI9259098.1;protein_id=CAI9259098.1
lsal_0  EMBL  three_prime_UTR 1739541 1739600 . - . ID=id-LSALG_LOCUS11;Parent=gene-LSALG_LOCUS11;Note=source:maker%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.exonthree_prime_utr;gbkey=3'UTR;loc
us_tag=LSALG_LOCUS11
lsal_0  EMBL  five_prime_UTR  1740693 1740699 . - . ID=id-LSALG_LOCUS11-2;Parent=gene-LSALG_LOCUS11;Note=source:maker%3B~ID:Lsal_1_v1_gn_0_00000010.mRNA1.exonfive_prime_utr;gbkey=5'UTR;lo
cus_tag=LSALG_LOCUS11

and the output we received

lsal_0  EMBL  gene  1739541 1740699 . - . gene_id "agat-gene-1"; ID "agat-gene-1"; Name "LSALG_LOCUS11"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "LSALG_LOCUS11";
lsal_0  AGAT  RNA 1739541 1740699 . - . gene_id "agat-gene-1"; transcript_id "gene-LSALG_LOCUS11"; ID "gene-LSALG_LOCUS11"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.exonfive_prime_utr"; Parent "agat-gene-1"; gbkey "5'UTR"; locus_tag "LSALG_LOCUS11";
lsal_0  AGAT  exon  1739541 1739600 . - . gene_id "agat-gene-1"; transcript_id "gene-LSALG_LOCUS11"; ID "agat-exon-1"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.exonfive_prime_utr"; Parent "gene-LSALG_LOCUS11"; gbkey "5'UTR"; locus_tag "LSALG_LOCUS11";
lsal_0  AGAT  exon  1740693 1740699 . - . gene_id "agat-gene-1"; transcript_id "gene-LSALG_LOCUS11"; ID "agat-exon-2"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.exonfive_prime_utr"; Parent "gene-LSALG_LOCUS11"; gbkey "5'UTR"; locus_tag "LSALG_LOCUS11";
lsal_0  EMBL  five_prime_UTR  1740693 1740699 . - . gene_id "agat-gene-1"; transcript_id "gene-LSALG_LOCUS11"; ID "id-LSALG_LOCUS11-2"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.exonfive_prime_utr"; Parent "gene-LSALG_LOCUS11"; gbkey "5'UTR"; locus_tag "LSALG_LOCUS11";
lsal_0  EMBL  three_prime_UTR 1739541 1739600 . - . gene_id "agat-gene-1"; transcript_id "gene-LSALG_LOCUS11"; ID "id-LSALG_LOCUS11"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.exonthree_prime_utr"; Parent "gene-LSALG_LOCUS11"; gbkey "3'UTR"; locus_tag "LSALG_LOCUS11";
lsal_0  AGAT  gene  1739541 1740699 . - . gene_id "agat-gene-2"; ID "agat-gene-2"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1"; gbkey "mRNA"; locus_tag "LSALG_LOCUS11";
lsal_0  EMBL  mRNA  1739541 1740699 . - . gene_id "agat-gene-2"; transcript_id "rna-LSALG_LOCUS11"; ID "rna-LSALG_LOCUS11"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1"; Parent "agat-gene-2"; gbkey "mRNA"; locus_tag "LSALG_LOCUS11";
lsal_0  EMBL  exon  1739541 1739686 . - . gene_id "agat-gene-2"; transcript_id "rna-LSALG_LOCUS11"; ID "exon-LSALG_LOCUS11-5"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1"; Parent "rna-LSALG_LOCUS11"; gbkey "mRNA"; locus_tag "LSALG_LOCUS11";
lsal_0  EMBL  exon  1739794 1739918 . - . gene_id "agat-gene-2"; transcript_id "rna-LSALG_LOCUS11"; ID "exon-LSALG_LOCUS11-4"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1"; Parent "rna-LSALG_LOCUS11"; gbkey "mRNA"; locus_tag "LSALG_LOCUS11";
lsal_0  EMBL  exon  1740020 1740095 . - . gene_id "agat-gene-2"; transcript_id "rna-LSALG_LOCUS11"; ID "exon-LSALG_LOCUS11-3"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1"; Parent "rna-LSALG_LOCUS11"; gbkey "mRNA"; locus_tag "LSALG_LOCUS11";
lsal_0  EMBL  exon  1740169 1740460 . - . gene_id "agat-gene-2"; transcript_id "rna-LSALG_LOCUS11"; ID "exon-LSALG_LOCUS11-2"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1"; Parent "rna-LSALG_LOCUS11"; gbkey "mRNA"; locus_tag "LSALG_LOCUS11";
lsal_0  EMBL  exon  1740549 1740699 . - . gene_id "agat-gene-2"; transcript_id "rna-LSALG_LOCUS11"; ID "exon-LSALG_LOCUS11-1"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1"; Parent "rna-LSALG_LOCUS11"; gbkey "mRNA"; locus_tag "LSALG_LOCUS11";
lsal_0  EMBL  CDS 1739601 1739686 . - 2 gene_id "agat-gene-2"; transcript_id "rna-LSALG_LOCUS11"; Dbxref "NCBI_GP:CAI9259098.1"; ID "cds-CAI9259098.1"; Name "CAI9259098.1"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS1;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS2;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS3;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS4;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS5"; Parent "rna-LSALG_LOCUS11"; gbkey "CDS"; locus_tag "LSALG_LOCUS11"; product "CAI9259098.1"; protein_id "CAI9259098.1";
lsal_0  EMBL  CDS 1739794 1739918 . - 1 gene_id "agat-gene-2"; transcript_id "rna-LSALG_LOCUS11"; Dbxref "NCBI_GP:CAI9259098.1"; ID "cds-CAI9259098.1"; Name "CAI9259098.1"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS1;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS2;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS3;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS4;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS5"; Parent "rna-LSALG_LOCUS11"; gbkey "CDS"; locus_tag "LSALG_LOCUS11"; product "CAI9259098.1"; protein_id "CAI9259098.1";
lsal_0  EMBL  CDS 1740020 1740095 . - 2 gene_id "agat-gene-2"; transcript_id "rna-LSALG_LOCUS11"; Dbxref "NCBI_GP:CAI9259098.1"; ID "cds-CAI9259098.1"; Name "CAI9259098.1"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS1;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS2;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS3;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS4;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS5"; Parent "rna-LSALG_LOCUS11"; gbkey "CDS"; locus_tag "LSALG_LOCUS11"; product "CAI9259098.1"; protein_id "CAI9259098.1";
lsal_0  EMBL  CDS 1740169 1740460 . - 0 gene_id "agat-gene-2"; transcript_id "rna-LSALG_LOCUS11"; Dbxref "NCBI_GP:CAI9259098.1"; ID "cds-CAI9259098.1"; Name "CAI9259098.1"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS1;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS2;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS3;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS4;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS5"; Parent "rna-LSALG_LOCUS11"; gbkey "CDS"; locus_tag "LSALG_LOCUS11"; product "CAI9259098.1"; protein_id "CAI9259098.1";
lsal_0  EMBL  CDS 1740549 1740692 . - 0 gene_id "agat-gene-2"; transcript_id "rna-LSALG_LOCUS11"; Dbxref "NCBI_GP:CAI9259098.1"; ID "cds-CAI9259098.1"; Name "CAI9259098.1"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS1;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS2;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS3;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS4;~ID:Lsal_1_v1_gn_0_00000010.mRNA1.CDS5"; Parent "rna-LSALG_LOCUS11"; gbkey "CDS"; locus_tag "LSALG_LOCUS11"; product "CAI9259098.1"; protein_id "CAI9259098.1";
lsal_0  AGAT  five_prime_UTR  1740693 1740699 . - . gene_id "agat-gene-2"; transcript_id "rna-LSALG_LOCUS11"; ID "agat-five_prime_utr-1"; Note "source:maker;~ID:Lsal_1_v1_gn_0_00000010.mRNA1"; Parent "rna-LSALG_LOCUS11"; gbkey "mRNA"; locus_tag "LSALG_LOCUS11";

ISSUE I feel the input GFF contains all the necessary features to convert only those lines to GTF, but we observe the additional IDs and RNA features introduced by agat. This makes it difficult to process and misleading.

In short,

  1. Why there was a new gene in the output "agat-gene-1"
  2. Why was there a change of ID attribute. eg:ID=gene-LSALG_LOCUS11; converted to ID "agat-gene-2" in gtf

Otherwise, am I missing something in this process?

Juke34 commented 1 week ago

Right your file was badly designed. Indeed you have the three_prime_UTR and the five_prime_UTR attached directly to the gene instead to be attached to the mRNA. AGAT try to follow the information provided. Then it create a new RNA to attach the UTR because it is not allowed to link them to the gene. But it has only one gene feature while it needs 2 and create a new one. Then I guess as it does not know to which gene it has to attach the original information, it get rid of it.

If it is consistent among the file you con fix that by removing the Parent attribute of all three_prime_UTR and the five_prime_UTR. Then when parsed by AGAT , it will attach those features to the previous mRNA encountered (sequentialy) when parsing the file.

Juke34 commented 1 week ago

You can even get rid of all three_prime_UTR and the five_prime_UTR features. AGAT will re-create them based on the CDS and exon coordinates.

sivasubramanics commented 1 week ago

Yeah. Now I understand. The Parent field in the UTRs is causing this issue. Thanks. Maybe you can add it in the Docs as well.