NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
459 stars 56 forks source link

agat_sp_manage_features.pl includes empty interpro output #147

Closed Neato-Nick closed 3 years ago

Neato-Nick commented 3 years ago

I noticed when an interpro domain is not found for an interproscan hit, it's still added to the dbxref list as '-'. This is easy enough to 'sed' out of the gff but wanted to report it anyway. I also don't think this invalidates the gff, but wanted to report it anyway. I noticed hits to CDD are a common culprit of this

example ipr output

PHRA102_6673.1  02e711a4621dd8379a18b3d8eb701f9e        410     CDD     cd06093 PX_domain       302     395     4.74892E-6      T       29-06-2021      -       -
PHRA102_6673.1  02e711a4621dd8379a18b3d8eb701f9e        410     Gene3D  G3DSA:3.30.1520.10      -       298     408     2.8E-8  T       29-06-2021      IPR036871       PX domain superfamily 
PHRA102_6673.1  02e711a4621dd8379a18b3d8eb701f9e        410     SUPERFAMILY     SSF64268        PX domain       302     395     1.96E-7 T       29-06-2021      IPR036871       PX domain superfamily

corresponding gff output (entry following Gene3D hit)

Phyram_PR-102_s0005     AUGUSTUS        mRNA    3335865 3337097 .       +       .       ID=PHRA102_6673.1;Parent=PHRA102_6673;Dbxref=CDD:cd06093,Gene3D:G3DSA:3.30.1520.10,InterPro:-,InterPro:IPR036871,SUPERFAMILY:SSF64268;Name=atl63;Ontology_term=GO:0035091;locus_tag=KRP23_6786;product=RING-H2 finger protein ATL63;uniprot_id=Q9LUZ9

Edit: To remove this from the output I used sed -i -E -e 's/InterPro:-,|,InterPro:-//g' my.gff

Neato-Nick commented 3 years ago

I also have a separate problem but is still related to parsing of the attributes column.

I noticed database references are added as "Dbxref:", is this distinct from "db_xref:" that GenBank uses, following insdc standards? Another thing easy for me to do a simple string substitution (or use _manage_attributes.pl to fix ;) )

Juke34 commented 3 years ago

Hi, we can definitly fix the problem and remove skip the - from the output.

Yes true we use Dbxref originally to be compliant with the GFF3 specification and genome browsers like Webapollo. INSDC uses instead the tag db_xref but it is exactly the same thing except INSDC accepts only information from specific databases to be stored in this attribute while GFF3 does not care. When Submitting to INSDC DB archive we use the ENA gate (some prefer the NCBI), and use EMBLmyGFF3 tool to prepare the required EMBL file. During the conversion we translate some attribute to match the expected term of INSDC (see here), and as example Dbxref is translated into db_xref.

Neato-Nick commented 3 years ago

Ah, ok. It was looking like this output was very close to INSDC standard but slightly different, and that makes sense.

I'm curious, during the emblmygff3 conversion, do you also move the value of the uniprot_id= tag into the db_xref list?

NCBI now has a tabl2asn_GFF tool so the GAG tool you reference in emblmyGFF3 will, thankfully, soon no longer be necessary. I've been testing GFFs from AGAT directly through that NCBI tool https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/

On Thu, Jul 1, 2021, 12:31 AM Jacques Dainat @.***> wrote:

Hi, we can definitly fix the problem and remove skip the - from the output.

Yes true we use Dbxref originally to be compliant with the GFF3 specification and genome browsers like Webapollo. INSDC use instead the tag db_xref but it is exactly the same thing except INSDC accept only information from specific databases to be stored in this attribute while GFF3 does not care. When Submitting to INSDC DB archive we use the ENA gate, and use EMBLmyGFF3 https://github.com/NBISweden/EMBLmyGFF3 tool to prepare the required EMBL file. During the conversion we translate some attribute to match the expected term of INSDC (see here https://github.com/NBISweden/EMBLmyGFF3/blob/master/EMBLmyGFF3/modules/translation_gff_attribute_to_embl_qualifier.json), and as example Dbxref is translated into db_xref.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NBISweden/AGAT/issues/147#issuecomment-871999590, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMUDUSX5L5HWXAJBEYTS3DTVQKVDANCNFSM47TGICKQ .

Juke34 commented 3 years ago

I'm curious, during the emblmygff3 conversion, do you also move the value of the uniprot_id= tag into the db_xref list?

Not by default but everything is possible within EMBLmyGFF3 ^^ you just need to tune the proper "mapping file" in this case it will be the translation_gff_attribute_to_embl_qualifier.json file that you can access by running EMBLmyGFF3 --expose_translations and then add the following information:

"uniprot_id": {
    "source description": "uniprot database cross reference.",
    "target": "db_xref",
    "dev comment": "Nothing special to say here"
},
Neato-Nick commented 3 years ago

Ok, great! Thanks for answering my questions, this has clarified a lot for me

On Thu, Jul 1, 2021, 6:40 AM Jacques Dainat @.***> wrote:

I'm curious, during the emblmygff3 conversion, do you also move the value of the uniprot_id= tag into the db_xref list?

Not by default but everything is possible within EMBLmyGFF3 ^^ you just need to tune the proper "mapping file" in this case it will be the translation_gff_attribute_to_embl_qualifier.json file that you can access by running EMBLmyGFF3 --expose_translations and then add the following information:

"uniprot_id": { "source description": "uniprot database cross reference.", "target": "db_xref", "dev comment": "Nothing special to say here" },

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NBISweden/AGAT/issues/147#issuecomment-872256581, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMUDUR2OU7HKAETILTXY4LTVRV3HANCNFSM47TGICKQ .