ababaian commented 4 years ago

We are now generating novel CoV sequences that are of high quality (complete assembled genomes or near-complete genomes). High quality sequences like Frank (Fr4NK?) and Ginger need to be deposited into the public GenBank repository ASAP.

As we expand analysis/assembly the volume of data we generate is going to explode and we will need to automate this process.

1) Collect the best version of Frank and Ginger and initiate a genbank submission for these sequences. 2) Create an inventory of the annotations and meta-data which we will need to attach (and how we can automate this process) 3) With our meta-data 'inventory' we can build a 'annotation' pipeline to generate specifically this data as a "deliverable" sequence. For our own use we can have more annotations but we need a core set required by GenBank.

Examples of good CoV Annotation

Some questions I had on this.

Across distant CoV (i.e. Alpha vs. Delta) are all the proteins more or less conserved. If so then we need a classifier tuned specifically for each of the ~25 ORF in CoV.

taltman commented 4 years ago

Of course, we need to follow these best-practices:

https://www.nature.com/articles/nbt.4306

I think the fastest bioinformatic path is to use Prokka on virus mode, and generate all of the files necessary for GenBank submission.

If we want to have a quality annotation to go along with it (and I strongly advise for this), then we should look to virus-specific annotation resources, as posted on the Wiki.

If we want to knock it out of the park, then we should lean on Robert's MUSCLE(s) when it comes to HMM design and search, to build HMMs for all coronavirus conserved proteins, and use that to annotate the novel coronavirus genomes. Of course, to build the HMMs, we need to have a basic systematic annotation of the known coronavirus genomes.

Genome annotations can be improved and resubmitted to GenBank, but in reality, unless it is a funded model organism database, it doesn't happen too often. I'd say let's agree on a minimal quality level that we can all be happy with, and then get it done.

taltman commented 4 years ago

Where is the image from, BTW?

ababaian commented 4 years ago

Good old UCSC Genome Browser

ababaian commented 4 years ago

We'll have HMM models for Pol and Spike hopefully soon (as we needed them badly), once that procedure is hmmered out we can hand it off and have them made for all the other proteins.

rcedgar commented 4 years ago

Edit: Deleted premature / uninformed comment by me.

rcedgar commented 4 years ago

Edit: RFTM (me). The Prokka tool mentioned by @taltman looks at first glance to be capable of high-throughput annotation with output in Genbank format. My bad. Would be fantastic if someone could volunteer to set up Prokka for this...

ababaian commented 4 years ago

Meta-data Required

[ ] Primary Contact Information
[ ] Sequence Author List
[ ] Reference for publication (if avail) - Unpublished
[ ] Sequence Technology - Illumina
[ ] Assembled Sequence OR unassembled sequence
[ ] Assembly Program Name
[ ] Assembly Program Version
[ ] Assembly Name
[ ] Coverage
[ ] Molecule Type - genomic RNA
[ ] Toplogy - Linear
[ ] Is the sequence complete
[ ] Fasta File
[ ] Submission Category - TPA (see below)
[ ] TPA - Evidence
[ ] TPA - GenBank Accessions
[ ] Source - Host
[ ] Source - Note (SRA Accession)
[ ] Source - Strain/Isolate *
[ ] Source - Country *
[ ] Source - Collection Date *

Submission Category

The category of submission we fall under would be "TPA:Inferential" See: https://www.ncbi.nlm.nih.gov/genbank/tpa/

Annotation Features

Feature annotation follows INSDC
Use 5-column Feature Table

This method is more suitable for: adding many different features on a single sequence or on multiple sequences uses the five-column, tab-delimited feature table format, which is also used in Sequin each table in the feature table file applies to only one sequence; if multiple sequences have been uploaded in your nucleotide fasta file, each corresponding table must be labeled with that sequence's Sequence ID multiple tables can be uploaded in a single file.

We can officially submit sequences without annotation, so there is no lower requirement. We can do a first pass annotation and add the obvious/easy meta-data and note entries where we are not satisfied and that will require better annotation. This is likely to be manual and time-intensive work so I suggest if this ends up LWIA we opt to 'crowd source' it to virologists qualified to do so. We should still aim for a good high-throughput annotation pipeline.

taltman commented 4 years ago

Note, the TPA page says:

Note: It is required that all new annotations will be experimentally determined to exist, directly or indirectly.

From their FAQ page:

Computational studies on their own do not constitute experimental evidence and must be accompanied by biological experiments that support the new annotation.

Our workflow is complex, and doesn't fit any of their neat bins exactly. Will reach out to my contacts at NCBI for guidance.

rcedgar commented 4 years ago

Our annotations will be TPA inferential: "A database of sequences annotated by inference, where the source molecule or its product(s) have not been the subject of direct experimentation."

https://www.ncbi.nlm.nih.gov/genbank/tpa-inf/

taltman commented 4 years ago

Emails sent, will update as I get more guidance.

taltman commented 4 years ago

I have received an initial email from the GenBank team. They have asserted the following:

The new annotation/assembly must be supported by experimental or inferential evidence. Sequence similarity, computational, or bioinformatic studies alone are not sufficient as supporting evidence.

I'm not clear what is meant by inferential evidence that is not experimental nor computational. They provide the following webpage providing a bunch of examples of "TPA:inferential" scenarios, but I'm still unclear about the actual definition:

https://www.ncbi.nlm.nih.gov/genbank/tpa-inf/

I've sent a quick reply asking for more description of what constitutes inferential evidence. My rough idea is that it involves indirect experimental evidence for a sequence or the annotation of the sequence.

taltman commented 4 years ago

Upon further reading of: https://www.ncbi.nlm.nih.gov/genbank/tpafaq/

What is the difference between TPA:experimental and TPA:inferential? Sequence records in the TPA:experimental database are supported directly by experimental evidence while sequence data and annotation in the TPA:inferential database is indirectly supported by experimental evidence.

So however you slice it, our sequences and annotations seem to need experimental evidence of some flavor in order to submit these TPA:inferential submissions to GenBank.

ababaian commented 3 years ago

This issue encompasses a set of submission issues which can be merged here to close this issue.

[ ] #186
[ ] #187
[ ] #188
[ ] #189
[ ] #190
[ ] #191

ababaian / serratus

Create a checklist for GenBank submission -- begin to automate for high throughput #106

Examples of good CoV Annotation

Meta-data Required

Submission Category

Annotation Features