Specific annotation requirements for viral TPA submissions

taltman commented 3 years ago

From the Handbook: https://www.ncbi.nlm.nih.gov/books/NBK53714/#gbankquickstart.i_have_viral_sequence_da

[ ] CDS feature(s) with product name(s), nucleotide locations, and amino acid translation(s) of all coding regions (showing start and stop codons, if present)
[ ] Gene symbol(s), if known

The information listed above should be applied to any virus submission.

If no coding region is present, provide another description of the sequence

If any of this information is not known, inform us at the time of your submission.

See an online example of viral sequence submission annotation.

Furthermore, the FASTA deflines should clearly indicate the primary sequence identifier:

> SEQ1 [org=coronavirus ABC123] [SRA=SRRXXXXXX1,SRRXXXXXX2]
ATGGTGTTTATAACACACACCTTAACCTACGACCTGGCAATCTTCTTGGCCACCTTAATAACGGCCTTTG
TAATTTACATAAAATGGGTGTACACATACTGGCAAAGAAAAGGTCTTGCTACAGAACCAACAGTCGTCCC
...

Double check that the files fulfill the following requirements:

https://www.ncbi.nlm.nih.gov/books/NBK53702/#gbankquickstart.can_you_give_me_stepbyst_1

https://www.ncbi.nlm.nih.gov/books/NBK53711/#gbankquickstart.what_do_you_mean_by_feat

rcedgar commented 3 years ago

"all coding regions (showing start and stop codons, if present)" Finding start and stop codons is difficult with Cov, this is the main reason I gave up trying to do automated annotation myself. Finding a known gene is relatively easy with a local protein alignment (say, BLAST or an HMM), but extending the alignment out to the start or stop is hard unless the genome is very close to something which is already very well annotated -- at most a few SNPs in the gene. This is further complicated by frameshifts in some CDSs due to polymerase slippage. This is a very tricky genome to annotate.

rcedgar commented 3 years ago

Figuring out CDS and gene symbols is also tricky because of the polyprotein which is cleaved into multiple genes. In these cases, both the poloyprotein before cleavage and the genes after cleavage should be annotated (I think...). I'm assuming cleaved genes lack start and/or stop codons, instead they have a cleavage site which should also be annotated; not sure, I never fully figured out how these things are represented in GB records.

rcedgar commented 3 years ago

Here is a nice figure showing the complexity of an example Cov genome (SARS-CoV-1). Note the multiple levels of overlapping and nested ORFs and CDSs with a frameshift in one of the most important genes (RdRp). The figure shows ~14 cleavage sites which must be identified. When I saw stuff like this, I figured it would be impossible to automate annotation unless there was an existing Cov-specific tool. Now I suspect that such a tool is impossible anyway because there is too much variation in genome structure. Cov-2 has suspected leaky scanning towards the 3' end which was not present in Cov-1 AFAIK, to add one more complication to getting the translations.

https://viralzone.expasy.org/30

ababaian commented 3 years ago

We would require to specifically find protease cut sites to go sub-ORF. I think ORF1a and ORF1b would be a good starting place in this respect and protease cut sites which should be conserved (I hope) would give us the info on RdRP etc... This data is not annotated great in GenBank records as it so the examples are hard to find.

rcedgar commented 3 years ago

Is cleavage well enough understood to know how well conserved the sites are, or if approximate sequence conservation necessarily implies cleavage or not? If the genome is diverged 1%, 2%, .. 5% ... 10% from a genome with known cleavage site, when do you / do you not believe the site is conserved?

taltman commented 3 years ago

@rcedgar I've posted screenshots of the VADR annotation to Slack a few times. We're getting comparable annotations to the standard NCBI annotations for the SARS-CoV-2 reference sequence. It won't be as accurate as someone hand-annotating, but it's the best we can do in a high-throughput fashion.

taltman commented 3 years ago

Especially for distantly-related CoVs.

rcedgar commented 3 years ago

Sure, I saw the screenshots and it looks like the best-known genes are in roughly the right place, but I don't know if GenBank will find this acceptable. I don't see how any automated method can reliably meet the requirements per their documentation -- we don't know the start, stop, cleavage sites etc. etc. and we don't have reliable translations for any of the genes AFAICS.

taltman commented 3 years ago

This all might be academic because GenBank might show us the door.

VADR is a tool used by NCBI for evaluating viral genome submissions, so I think we have a better chance that it will produce the content that they are looking for. I don't think the annotation has to be flawless to pass muster; just demonstrate "due diligence" in trying to do a reasonable job.

ababaian / serratus

Specific annotation requirements for viral TPA submissions #187