Closed Neato-Nick closed 3 years ago
I also have a separate problem but is still related to parsing of the attributes column.
I noticed database references are added as "Dbxref:", is this distinct from "db_xref:" that GenBank uses, following insdc standards? Another thing easy for me to do a simple string substitution (or use _manage_attributes.pl to fix ;) )
Hi, we can definitly fix the problem and remove skip the -
from the output.
Yes true we use Dbxref
originally to be compliant with the GFF3 specification and genome browsers like Webapollo.
INSDC uses instead the tag db_xref
but it is exactly the same thing except INSDC accepts only information from specific databases to be stored in this attribute while GFF3 does not care.
When Submitting to INSDC DB archive we use the ENA gate (some prefer the NCBI), and use EMBLmyGFF3 tool to prepare the required EMBL file. During the conversion we translate some attribute to match the expected term of INSDC (see here), and as example Dbxref
is translated into db_xref
.
Ah, ok. It was looking like this output was very close to INSDC standard but slightly different, and that makes sense.
I'm curious, during the emblmygff3 conversion, do you also move the value of the uniprot_id= tag into the db_xref list?
NCBI now has a tabl2asn_GFF tool so the GAG tool you reference in emblmyGFF3 will, thankfully, soon no longer be necessary. I've been testing GFFs from AGAT directly through that NCBI tool https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/
On Thu, Jul 1, 2021, 12:31 AM Jacques Dainat @.***> wrote:
Hi, we can definitly fix the problem and remove skip the - from the output.
Yes true we use Dbxref originally to be compliant with the GFF3 specification and genome browsers like Webapollo. INSDC use instead the tag db_xref but it is exactly the same thing except INSDC accept only information from specific databases to be stored in this attribute while GFF3 does not care. When Submitting to INSDC DB archive we use the ENA gate, and use EMBLmyGFF3 https://github.com/NBISweden/EMBLmyGFF3 tool to prepare the required EMBL file. During the conversion we translate some attribute to match the expected term of INSDC (see here https://github.com/NBISweden/EMBLmyGFF3/blob/master/EMBLmyGFF3/modules/translation_gff_attribute_to_embl_qualifier.json), and as example Dbxref is translated into db_xref.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NBISweden/AGAT/issues/147#issuecomment-871999590, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMUDUSX5L5HWXAJBEYTS3DTVQKVDANCNFSM47TGICKQ .
I'm curious, during the emblmygff3 conversion, do you also move the value of the uniprot_id= tag into the db_xref list?
Not by default but everything is possible within EMBLmyGFF3 ^^ you just need to tune the proper "mapping file" in this case it will be the translation_gff_attribute_to_embl_qualifier.json
file that you can access by running EMBLmyGFF3 --expose_translations
and then add the following information:
"uniprot_id": {
"source description": "uniprot database cross reference.",
"target": "db_xref",
"dev comment": "Nothing special to say here"
},
Ok, great! Thanks for answering my questions, this has clarified a lot for me
On Thu, Jul 1, 2021, 6:40 AM Jacques Dainat @.***> wrote:
I'm curious, during the emblmygff3 conversion, do you also move the value of the uniprot_id= tag into the db_xref list?
Not by default but everything is possible within EMBLmyGFF3 ^^ you just need to tune the proper "mapping file" in this case it will be the translation_gff_attribute_to_embl_qualifier.json file that you can access by running EMBLmyGFF3 --expose_translations and then add the following information:
"uniprot_id": { "source description": "uniprot database cross reference.", "target": "db_xref", "dev comment": "Nothing special to say here" },
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NBISweden/AGAT/issues/147#issuecomment-872256581, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMUDUR2OU7HKAETILTXY4LTVRV3HANCNFSM47TGICKQ .
I noticed when an interpro domain is not found for an interproscan hit, it's still added to the dbxref list as '-'. This is easy enough to 'sed' out of the gff but wanted to report it anyway. I also don't think this invalidates the gff, but wanted to report it anyway. I noticed hits to CDD are a common culprit of this
example ipr output
corresponding gff output (entry following Gene3D hit)
Edit: To remove this from the output I used
sed -i -E -e 's/InterPro:-,|,InterPro:-//g' my.gff