Open ababaian opened 4 years ago
Of course, we need to follow these best-practices:
https://www.nature.com/articles/nbt.4306
I think the fastest bioinformatic path is to use Prokka on virus mode, and generate all of the files necessary for GenBank submission.
If we want to have a quality annotation to go along with it (and I strongly advise for this), then we should look to virus-specific annotation resources, as posted on the Wiki.
If we want to knock it out of the park, then we should lean on Robert's MUSCLE(s) when it comes to HMM design and search, to build HMMs for all coronavirus conserved proteins, and use that to annotate the novel coronavirus genomes. Of course, to build the HMMs, we need to have a basic systematic annotation of the known coronavirus genomes.
Genome annotations can be improved and resubmitted to GenBank, but in reality, unless it is a funded model organism database, it doesn't happen too often. I'd say let's agree on a minimal quality level that we can all be happy with, and then get it done.
Where is the image from, BTW?
We'll have HMM models for Pol
and Spike
hopefully soon (as we needed them badly), once that procedure is hmmered out we can hand it off and have them made for all the other proteins.
Edit: Deleted premature / uninformed comment by me.
Edit: RFTM (me). The Prokka tool mentioned by @taltman looks at first glance to be capable of high-throughput annotation with output in Genbank format. My bad. Would be fantastic if someone could volunteer to set up Prokka for this...
The category of submission we fall under would be "TPA:Inferential" See: https://www.ncbi.nlm.nih.gov/genbank/tpa/
This method is more suitable for: adding many different features on a single sequence or on multiple sequences uses the five-column, tab-delimited feature table format, which is also used in Sequin each table in the feature table file applies to only one sequence; if multiple sequences have been uploaded in your nucleotide fasta file, each corresponding table must be labeled with that sequence's Sequence ID multiple tables can be uploaded in a single file.
We can officially submit sequences without annotation, so there is no lower requirement. We can do a first pass annotation and add the obvious/easy meta-data and note entries where we are not satisfied and that will require better annotation. This is likely to be manual and time-intensive work so I suggest if this ends up LWIA we opt to 'crowd source' it to virologists qualified to do so. We should still aim for a good high-throughput annotation pipeline.
Note, the TPA page says:
Note: It is required that all new annotations will be experimentally determined to exist, directly or indirectly.
From their FAQ page:
Computational studies on their own do not constitute experimental evidence and must be accompanied by biological experiments that support the new annotation.
Our workflow is complex, and doesn't fit any of their neat bins exactly. Will reach out to my contacts at NCBI for guidance.
Our annotations will be TPA inferential: "A database of sequences annotated by inference, where the source molecule or its product(s) have not been the subject of direct experimentation."
Emails sent, will update as I get more guidance.
I have received an initial email from the GenBank team. They have asserted the following:
The new annotation/assembly must be supported by experimental or inferential evidence. Sequence similarity, computational, or bioinformatic studies alone are not sufficient as supporting evidence.
I'm not clear what is meant by inferential evidence that is not experimental nor computational. They provide the following webpage providing a bunch of examples of "TPA:inferential" scenarios, but I'm still unclear about the actual definition:
https://www.ncbi.nlm.nih.gov/genbank/tpa-inf/
I've sent a quick reply asking for more description of what constitutes inferential evidence. My rough idea is that it involves indirect experimental evidence for a sequence or the annotation of the sequence.
Upon further reading of: https://www.ncbi.nlm.nih.gov/genbank/tpafaq/
What is the difference between TPA:experimental and TPA:inferential? Sequence records in the TPA:experimental database are supported directly by experimental evidence while sequence data and annotation in the TPA:inferential database is indirectly supported by experimental evidence.
So however you slice it, our sequences and annotations seem to need experimental evidence of some flavor in order to submit these TPA:inferential submissions to GenBank.
This issue encompasses a set of submission issues which can be merged here to close this issue.
We are now generating novel CoV sequences that are of high quality (complete assembled genomes or near-complete genomes). High quality sequences like Frank (Fr4NK?) and Ginger need to be deposited into the public GenBank repository ASAP.
As we expand analysis/assembly the volume of data we generate is going to explode and we will need to automate this process.
1) Collect the best version of Frank and Ginger and initiate a genbank submission for these sequences. 2) Create an inventory of the annotations and meta-data which we will need to attach (and how we can automate this process) 3) With our meta-data 'inventory' we can build a 'annotation' pipeline to generate specifically this data as a "deliverable" sequence. For our own use we can have more annotations but we need a core set required by GenBank.
Examples of good CoV Annotation
Some questions I had on this.