Open nathandunn opened 4 years ago
The essence of the task would be built around two programs table2asn and asnvalidate. Documentation is at: https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/
The binaries are at: https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/table2asn_GFF/ https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/asnvalidate/
table2asn takes GFF3 (or other formats) + FASTA and converts to ASN.1 asnvalidate then makes sure it’s sound
Doing a whole genome takes time, but you could snip out the region of FASTA + its annotation (adjusting coordinates for any 5’ trimming done to the FASTA), convert, and validate just one gene. That would be quick. For actual submission to GenBank, a submitter typically pre-processes to ASN.1 and uploads that to the GenBank submission portal, or I think it can now take GFF3 + FASTA and do all the work for you (but there are typically issues to fix, so I think it’s easier to do it beforehand). Integrating that into Apollo, and/or making sure that it produces submission-compliant GFF3, would be a big plus, too.
This would be really useful in principle. Just to clarify, though - the use case here is to do an initial sanity check on individual annotations, right? As opposed to exporting annotations that are 'NCBI-submission ready' and will display what the annotator/submitter expects on NCBI's nucleotide and protein pages after submission. The latter would require a lot more work on Apollo's part, but would certainly also be useful (I'm happy to comment on what the additional work is if needed).
Assuming you're just considering the former:
Name
, symbol
and description
attributes. NCBI handles the concepts that are represented by these fields somewhat differently than Apollo does. If you don't do some initial reformatting, you might get warnings or errors. Also, I think it would be great if Apollo could use this feature as a step towards remodeling the Apollo metadata to be a bit more NCBI-compatible. Here's how I would recommend handling the Name
, symbol
and description
attributes for validation (and I'm happy to stand corrected on these):description
attribute. This will show as gene_desc in asn and /note in the NCBI flatfile. product
attribute on mRNA and corresponding CDS features. product
attributegene_synonym
or gene
attribute. Note
attribute.description
attributes. (That said, annotators often misinterpret this field as other notes...)Note
attribute.To comment a bit on validation of individual annotations for annotators vs. NCBI-ready export of all annotations for the admin: I think both would be useful. As an admin user, I could imagine asking our annotators to validate their annotations before they sign off on them, and address any error output that they'd receive. That would ease the NCBI submission for me downstream. One aspect that might need refinement is how the output from the validation is presented to users - would your average annotator be able to interpret the output and know how to correct the error (e.g. Warning: valid [SEQ_FEAT.PartialProblemNotSpliceConsensus3Prime] 3' partial is not at end of sequence, gap, or consensus splice site FEATURE: CDS: Putative dual specificity mitogen-activated protein kinase kinase 7-like <1579> [(lcl|Scaffold446.1:c199703-199559, c196763-196612, c190299-190186, c188294-188067, c186634-186455, c185566-<185450)] [lcl|Scaffold446.1: delta, dna len= 478295] -> [gnl|A483|HHAL011855-PA]
)? That might just require more user training by the admin, though. However, also having the piece where all exported annotations are more NCBI-submission ready would be super helpful, and would go a bit further towards Apollo-NCBI integration.
the use case here is to do an initial sanity check on individual annotations, right? As opposed to exporting annotations that are 'NCBI-submission ready' and will display what the annotator/submitter expects on NCBI's nucleotide and protein pages after submission.
I would like to do both. I think doing at the level of individual annotations would be ideal (on-demand). Doing a final export version might be easier, but like you said, flagging potential problems will be the more difficult part.
https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/asnval_8cpp_source.html
The problem is that this is C++ and we have to convert these to ASN. Distributing the C++ binaries prebuilt might be an option (especially for Docker), though.