Deliver uniform annotation of reference coronavirus genomes

taltman commented 4 years ago

To better evaluate read hits to the reference genomes, we need to have a quality, consistent annotation of those genomes, to see where the reads are hitting, and whether they are significant.

taltman commented 4 years ago

@ababaian , could you please help me describe the problem here, and the list of requirements that would make the Bowtie2 hit analysis easier?

rcedgar commented 4 years ago

It would be very helpful to know the coordinates of biologically important segments such as "spike protein" and "polymerase" in every reference sequence. That way, we can automate detection and visualization of coverage on a gene-by-gene basis, and it would also enable biologically informative analyses of the Cov family itself such as gene gain and loss which AFAIK is not published though I haven't tried very hard to find it -- reference solicited! I've taken a stab at cleaning up the annotations, and it looks to be a lot of work because annotations in the Genbank files have variable quality, nomenclature etc. Looks to me like a lot of manual effort would be needed to clean this up. If in fact this work has not been done already, it would be a useful contribution / preprint / paper in its own right and doesn't require much / any understanding of Serratus which can be a bit daunting to a newcomer.

Edit: to be clear, the annotations should be harmonized so that there is a single system of identifiers and meta-data so that "spike protein" is always "spike protiein" rather than ORF1, unidentified protein etc.

rcedgar commented 4 years ago

Here's an idea for generating uniform annotations with less effort. Build HMMs from the well-annotated proteins and other features in the Genbank records. Use these to find homologs in the other reference sequences. Use these to make expanded MSAs and better HMMs. Might take a couple of iterations, but this is the kind of thing you can do in a few hours. Do this with both nt and aa sequences. The HMMs can be used to annotate contigs directly without needing a reference alignment.

ababaian commented 4 years ago

Sequence Annotation Fields

I'd like to have all the coronavirus metadata in a single, uniform, flat TSV file. This need not contain the sequence data directly, only a uniform and clean meta-data describing each sequence. (see also: #45 and Genbank Parsing for a start to this and the near complete genbank parser https://github.com/ababaian/serratus/pull/61).

We'll then want this metadata generated for every 'new genome' that we find.

Desired Fields

For every sequence in cov0

[ ] Accession ID
[ ] CoV Species Taxonomy ID
[ ] CoV Species description
[ ] Host Species Taxonomy ID
[ ] Inferred Host Species Taxonomy ID. When a host species is not known, use an 'inferred' host species from the closest known virus.
[ ] Host / Inferred Host Description
[ ] Whole genome or partial
[ ] Sequence length
[ ] Sequence coverage (how many NNNN are there)
[ ] open reading frames in sequence (comma seperated). Process with a pan-CoV and uniform nomenclature.
[ ] Coordinates of each ORF
[ ] Is this a blacklisted sequence?

Motivation: Why do we need this right now? We are currently using cov2m which contains both full-length genomes and sequence fragments from highly diverged CoV (i.e. cloned pol gene). We want to allow the 'summarizer' to bin hits onto a theoretical/rough "pan-genome" such that a divergent virus like Frankie, when the hits are distributed across 10+ genomes and fragments, all of those hits can be uniformly placed onto the 'pan-genome' or the 'pan-pol' or 'pan-spike'.

This is similar to the 'liftOver` operation between genomes, but in this instance we want all genomes/fragments to be 'liftOver' to a uniform pan-genome.

With this meta-data, we can then align proteins in sequences/fragments and create a true 'pan-genome'. For instance, if we only have the Spike gene in a fragment, this belongs to coordinates ~21,000-23,000 on the pan-genome.

edit: Then of course for the assembly/contigs that are generated in new libraries. We can quickly infer meta-data about what should be in a contig, by knowing approximately where it sits on the pan-CoV genome.

rcedgar commented 4 years ago

I'm assigning myself to develop a system for annotating a Cov nt sequence with coordinates of functional domains (ORFs, regulatory sequences etc.). Something like PFAM HMMs I'm thinking.

rcedgar commented 4 years ago

@ababaian said "this belongs to coordinates ~21,000-23,000 on the pan-genome" This kind of system works well for SSU genes and immunoglobulins, would be ideal here. It may not be possible, or a good idea, if there is gene gain or loss, or mosaic / chimeric viruses in the family. First step is to parse into functional units, then ask if domain organization is well conserved.

rcedgar commented 4 years ago

Related note: naively projecting onto a pan-genome with 32 bins (summarizer strategy) is robust against the complications mentioned in my last comment.

taltman commented 4 years ago

@rcedgar If you want to take this on, awesome! I might suggest some tools here: Prokka, for performing virus-oriented gene calling and a basic annotation using UniProt EggNOG-Mapper, which has viral HMMs and DIAMOND protein databases built already. Also, check out these resources from Steven Hallam: https://github.com/ababaian/serratus/wiki/Serratus-Annotation

taltman commented 4 years ago

We now have VADR for annotating CoVs. Do we still want to annotate the reference CoV genomes?

rcedgar commented 4 years ago

If VADR is the tool we're going to use in production, then would be great validation to provide VADR annotations of the ~800 full-length Cov genomes currently in GenBank. For the RefSeqs especially, this will show the strengths and weaknesses of the method. Edit -- Actually I guess not the RefSeqs because they are used as references by VADR IIRC.

ababaian / serratus

Deliver uniform annotation of reference coronavirus genomes #101

Sequence Annotation Fields

Desired Fields