asnvalidate could be used to provide instant validation on changes

nathandunn commented 4 years ago

https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/asnval_8cpp_source.html

The problem is that this is C++ and we have to convert these to ASN. Distributing the C++ binaries prebuilt might be an option (especially for Docker), though.

nathandunn commented 4 years ago

The essence of the task would be built around two programs table2asn and asnvalidate. Documentation is at: https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/

The binaries are at: https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/table2asn_GFF/ https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/asnvalidate/

table2asn takes GFF3 (or other formats) + FASTA and converts to ASN.1 asnvalidate then makes sure it’s sound

Doing a whole genome takes time, but you could snip out the region of FASTA + its annotation (adjusting coordinates for any 5’ trimming done to the FASTA), convert, and validate just one gene. That would be quick. For actual submission to GenBank, a submitter typically pre-processes to ASN.1 and uploads that to the GenBank submission portal, or I think it can now take GFF3 + FASTA and do all the work for you (but there are typically issues to fix, so I think it’s easier to do it beforehand). Integrating that into Apollo, and/or making sure that it produces submission-compliant GFF3, would be a big plus, too.

mpoelchau commented 4 years ago

This would be really useful in principle. Just to clarify, though - the use case here is to do an initial sanity check on individual annotations, right? As opposed to exporting annotations that are 'NCBI-submission ready' and will display what the annotator/submitter expects on NCBI's nucleotide and protein pages after submission. The latter would require a lot more work on Apollo's part, but would certainly also be useful (I'm happy to comment on what the additional work is if needed).

Assuming you're just considering the former:

table2asn_GFF also does validation (at least previous versions have), but if NCBI recommends converting to asn first and then using asnvalidate, then do what they suggest
One of the issues that the validator might run into is names.
- The main issue for 'instant validation' is reformatting the Name, symbol and description attributes. NCBI handles the concepts that are represented by these fields somewhat differently than Apollo does. If you don't do some initial reformatting, you might get warnings or errors. Also, I think it would be great if Apollo could use this feature as a step towards remodeling the Apollo metadata to be a bit more NCBI-compatible. Here's how I would recommend handling the Name, symbol and description attributes for validation (and I'm happy to stand corrected on these):
- gene name attributes: This means a short name, but not an abbreviation (e.g. ultraspiracle). Move to description attribute. This will show as gene_desc in asn and /note in the NCBI flatfile.
- mRNA name attributes: Move to product attribute on mRNA and corresponding CDS features.
- other name attributes: I think other transcript types should also move the name to the product attribute
- gene symbol attributes: Gene symbols are short abbreviation for a name (e.g. usp). Move to gene_synonym or gene attribute.
- mRNA and other symbol attributes: NCBI does not accept these for eukaryotes afaik (but possibly for prokaryotes). Not sure how Apollo would want to handle this. Internally, what I do is move them to a Note attribute.
- gene description attributes: These should represent the 'gene full name'. Leave as description attributes. (That said, annotators often misinterpret this field as other notes...)
- mRNA and other description attributes: Move to Note attribute.
- There's the broader issue of whether the names conform to the INSDC's naming rules (https://www.ncbi.nlm.nih.gov/genome/doc/internatprot_nomenguide/) - in the context of an instant validator, though, this is probably out of scope.
Pseudogenes may also become an issue because they're handled quite differently in NCBI. See http://www.insdc.org/documents/pseudogene-qualifier-vocabulary. Most of the pseudogenes that we have at the i5k Workspace, the annotators don't necessarily know why they're pseudogenes (e.g. are they actually pseudogenes or is the assembly bad). We model these as gene -> mRNA -> exon (no CDS and polypeptides) and attach the qualifier pseudogene=unknown. Having Apollo treat pseudogenes differently might be a separate issue though.

To comment a bit on validation of individual annotations for annotators vs. NCBI-ready export of all annotations for the admin: I think both would be useful. As an admin user, I could imagine asking our annotators to validate their annotations before they sign off on them, and address any error output that they'd receive. That would ease the NCBI submission for me downstream. One aspect that might need refinement is how the output from the validation is presented to users - would your average annotator be able to interpret the output and know how to correct the error (e.g. Warning: valid [SEQ_FEAT.PartialProblemNotSpliceConsensus3Prime] 3' partial is not at end of sequence, gap, or consensus splice site FEATURE: CDS: Putative dual specificity mitogen-activated protein kinase kinase 7-like <1579> [(lcl|Scaffold446.1:c199703-199559, c196763-196612, c190299-190186, c188294-188067, c186634-186455, c185566-<185450)] [lcl|Scaffold446.1: delta, dna len= 478295] -> [gnl|A483|HHAL011855-PA])? That might just require more user training by the admin, though. However, also having the piece where all exported annotations are more NCBI-submission ready would be super helpful, and would go a bit further towards Apollo-NCBI integration.

nathandunn commented 4 years ago

the use case here is to do an initial sanity check on individual annotations, right? As opposed to exporting annotations that are 'NCBI-submission ready' and will display what the annotator/submitter expects on NCBI's nucleotide and protein pages after submission.

I would like to do both. I think doing at the level of individual annotations would be ideal (on-demand). Doing a final export version might be easier, but like you said, flagging potential problems will be the more difficult part.

GMOD / Apollo

asnvalidate could be used to provide instant validation on changes #2348