NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
465 stars 56 forks source link

Multiple quoted entries for a single attribute #304

Closed mossconfuse closed 1 year ago

mossconfuse commented 2 years ago

I faced a minor issue converting a particular gff3 file to gtf using the agat_convert_sp_gff2gtf.pl function. I downloaded the .gff3 file from NCBI and ran agat using

$ wget -O NC_005087.2.gff3 "ncbi.nlm.nih.gov/sviewer/viewer.cgi?db=nuccore&report=gff3&id=NC_005087.2,NC_005087.2"
$ agat_convert_sp_gff2gtf.pl -i NC_005087.2.gff3 -o NC_005087.2.gtf3

It mostly seems good:

$ head NC_005087.2.gtf3 
##gtf-version 3
##sequence-region NC_005087.2 1 122890
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3218
NC_005087.2 RefSeq  gene    233 981 .   +   .   gene_id "nbis-gene-201"; Dbxref "GeneID:38831453"; ID "nbis-gene-201"; Name "3' rps12"; gbkey "Gene"; gene "3' rps12"; gene_biotype "other"; is_ordered "true"; locus_tag "PhpapaC_p1"; partial "true";
NC_005087.2 RefSeq  transcript  233 981 .   +   .   gene_id "nbis-gene-201"; transcript_id "gene-PhpapaC_p1"; Dbxref "GeneID:38831453"; ID "gene-PhpapaC_p1"; Name "3' rps12"; Parent "nbis-gene-201"; gbkey "Gene"; gene "3' rps12"; gene_biotype "other"; is_ordered "true"; locus_tag "PhpapaC_p1"; original_biotype "rna"; partial "true";
NC_005087.2 RefSeq  exon    233 464 .   +   .   gene_id "nbis-gene-201"; transcript_id "gene-PhpapaC_p1"; Dbxref "GeneID:38831453"; ID "id-PhpapaC_p1"; Parent "gene-PhpapaC_p1"; gbkey "exon"; gene "3' rps12"; locus_tag "PhpapaC_p1"; product "ribosomal protein S12";
NC_005087.2 RefSeq  exon    956 981 .   +   .   gene_id "nbis-gene-201"; transcript_id "gene-PhpapaC_p1"; Dbxref "GeneID:38831453"; ID "nbis-exon-1"; Parent "gene-PhpapaC_p1"; gbkey "exon"; gene "3' rps12"; locus_tag "PhpapaC_p1"; product "ribosomal protein S12";

The issue is that some of the attributes are followed by two quoted values (note the Dbxref "Genbank:NP_904165.1" "GeneID:2546745"):

$ grep '" "' NC_005087.2.gtf3|head
NC_005087.2 RefSeq  exon    1033    1500    .   +   .   gene_id "nbis-gene-33"; transcript_id "gene-PhpapaCp002"; Dbxref "Genbank:NP_904165.1" "GeneID:2546745"; ID "nbis-exon-9"; Name "NP_904165.1"; Parent "gene-PhpapaCp002"; gbkey "CDS"; gene "rps7"; locus_tag "PhpapaCp002"; product "ribosomal protein S7"; protein_id "NP_904165.1"; transl_table "11";
NC_005087.2 RefSeq  CDS 1033    1500    .   +   0   gene_id "nbis-gene-33"; transcript_id "gene-PhpapaCp002"; Dbxref "Genbank:NP_904165.1" "GeneID:2546745"; ID "cds-NP_904165.1"; Name "NP_904165.1"; Parent "gene-PhpapaCp002"; gbkey "CDS"; gene "rps7"; locus_tag "PhpapaCp002"; product "ribosomal protein S7"; protein_id "NP_904165.1"; transl_table "11";
NC_005087.2 RefSeq  exon    1033    1500    .   +   .   gene_id "nbis-gene-34"; transcript_id "gene-PhpapaCp002-2"; Dbxref "Genbank:NP_904165.1" "GeneID:2546745"; ID "nbis-exon-10"; Name "NP_904165.1"; Parent "gene-PhpapaCp002-2"; gbkey "CDS"; gene "rps7"; locus_tag "PhpapaCp002"; product "ribosomal protein S7"; protein_id "NP_904165.1"; transl_table "11";

While this is an easy issue to fix, it took me a while to figure it out (this issue caused problems for cellranger-arc mkref, but the error file was not specific.) I suspect this is more due to the gff3 file, but if agat could catch this early and resolve it, that would help. My workaround was

$ sed -i.bak 's/" "GeneID/"; Dbxref_2 "GeneID/g' NC_005087.2.gtf3

I am using the latest version ,v 1.0.0.

Juke34 commented 2 years ago

Hi, thank you for your question. As for GFF formats, GTF formats accept attributes with multiple values .
See https://agat.readthedocs.io/en/latest/gxf.html#gtf
Attributes must end in a semicolon which must then be separated from the start of any subsequent attribute by exactly one space character (NOT a tab character). Attributes’ values should be surrounded by double quotes.

So, it is not a bug. You better contact cellranger developers and ask them to adapt the code that they can handle attribtutes with multiple values.