GMOD / Apollo

Genome annotation editor with a Java Server backend and a Javascript client that runs in a web browser as a JBrowse plugin.
http://genomearchitect.readthedocs.io/
Other
128 stars 85 forks source link

asnvalidate could be used to provide instant validation on changes #2348

Open nathandunn opened 4 years ago

nathandunn commented 4 years ago

https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/asnval_8cpp_source.html

The problem is that this is C++ and we have to convert these to ASN. Distributing the C++ binaries prebuilt might be an option (especially for Docker), though.

nathandunn commented 4 years ago

The essence of the task would be built around two programs table2asn and asnvalidate. Documentation is at: https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/

The binaries are at: https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/table2asn_GFF/ https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/asnvalidate/

table2asn takes GFF3 (or other formats) + FASTA and converts to ASN.1 asnvalidate then makes sure it’s sound

Doing a whole genome takes time, but you could snip out the region of FASTA + its annotation (adjusting coordinates for any 5’ trimming done to the FASTA), convert, and validate just one gene. That would be quick. For actual submission to GenBank, a submitter typically pre-processes to ASN.1 and uploads that to the GenBank submission portal, or I think it can now take GFF3 + FASTA and do all the work for you (but there are typically issues to fix, so I think it’s easier to do it beforehand). Integrating that into Apollo, and/or making sure that it produces submission-compliant GFF3, would be a big plus, too.

mpoelchau commented 4 years ago

This would be really useful in principle. Just to clarify, though - the use case here is to do an initial sanity check on individual annotations, right? As opposed to exporting annotations that are 'NCBI-submission ready' and will display what the annotator/submitter expects on NCBI's nucleotide and protein pages after submission. The latter would require a lot more work on Apollo's part, but would certainly also be useful (I'm happy to comment on what the additional work is if needed).

Assuming you're just considering the former:

To comment a bit on validation of individual annotations for annotators vs. NCBI-ready export of all annotations for the admin: I think both would be useful. As an admin user, I could imagine asking our annotators to validate their annotations before they sign off on them, and address any error output that they'd receive. That would ease the NCBI submission for me downstream. One aspect that might need refinement is how the output from the validation is presented to users - would your average annotator be able to interpret the output and know how to correct the error (e.g. Warning: valid [SEQ_FEAT.PartialProblemNotSpliceConsensus3Prime] 3' partial is not at end of sequence, gap, or consensus splice site FEATURE: CDS: Putative dual specificity mitogen-activated protein kinase kinase 7-like <1579> [(lcl|Scaffold446.1:c199703-199559, c196763-196612, c190299-190186, c188294-188067, c186634-186455, c185566-<185450)] [lcl|Scaffold446.1: delta, dna len= 478295] -> [gnl|A483|HHAL011855-PA])? That might just require more user training by the admin, though. However, also having the piece where all exported annotations are more NCBI-submission ready would be super helpful, and would go a bit further towards Apollo-NCBI integration.

nathandunn commented 4 years ago

the use case here is to do an initial sanity check on individual annotations, right? As opposed to exporting annotations that are 'NCBI-submission ready' and will display what the annotator/submitter expects on NCBI's nucleotide and protein pages after submission.

I would like to do both. I think doing at the level of individual annotations would be ideal (on-demand). Doing a final export version might be easier, but like you said, flagging potential problems will be the more difficult part.