Closed taltman closed 3 years ago
@ababaian , could you please help me describe the problem here, and the list of requirements that would make the Bowtie2 hit analysis easier?
It would be very helpful to know the coordinates of biologically important segments such as "spike protein" and "polymerase" in every reference sequence. That way, we can automate detection and visualization of coverage on a gene-by-gene basis, and it would also enable biologically informative analyses of the Cov family itself such as gene gain and loss which AFAIK is not published though I haven't tried very hard to find it -- reference solicited! I've taken a stab at cleaning up the annotations, and it looks to be a lot of work because annotations in the Genbank files have variable quality, nomenclature etc. Looks to me like a lot of manual effort would be needed to clean this up. If in fact this work has not been done already, it would be a useful contribution / preprint / paper in its own right and doesn't require much / any understanding of Serratus which can be a bit daunting to a newcomer.
Edit: to be clear, the annotations should be harmonized so that there is a single system of identifiers and meta-data so that "spike protein" is always "spike protiein" rather than ORF1, unidentified protein etc.
Here's an idea for generating uniform annotations with less effort. Build HMMs from the well-annotated proteins and other features in the Genbank records. Use these to find homologs in the other reference sequences. Use these to make expanded MSAs and better HMMs. Might take a couple of iterations, but this is the kind of thing you can do in a few hours. Do this with both nt and aa sequences. The HMMs can be used to annotate contigs directly without needing a reference alignment.
I'd like to have all the coronavirus metadata in a single, uniform, flat TSV file. This need not contain the sequence data directly, only a uniform and clean meta-data describing each sequence. (see also: #45 and Genbank Parsing for a start to this and the near complete genbank parser https://github.com/ababaian/serratus/pull/61).
We'll then want this metadata generated for every 'new genome' that we find.
For every sequence in cov0
NNNN
are there)Motivation: Why do we need this right now? We are currently using cov2m
which contains both full-length genomes and sequence fragments from highly diverged CoV (i.e. cloned pol gene). We want to allow the 'summarizer' to bin hits onto a theoretical/rough "pan-genome" such that a divergent virus like Frankie, when the hits are distributed across 10+ genomes and fragments, all of those hits can be uniformly placed onto the 'pan-genome' or the 'pan-pol' or 'pan-spike'.
This is similar to the 'liftOver` operation between genomes, but in this instance we want all genomes/fragments to be 'liftOver' to a uniform pan-genome.
With this meta-data, we can then align proteins in sequences/fragments and create a true 'pan-genome'. For instance, if we only have the Spike gene in a fragment, this belongs to coordinates ~21,000-23,000 on the pan-genome.
edit: Then of course for the assembly/contigs that are generated in new libraries. We can quickly infer meta-data about what should be in a contig, by knowing approximately where it sits on the pan-CoV genome.
I'm assigning myself to develop a system for annotating a Cov nt sequence with coordinates of functional domains (ORFs, regulatory sequences etc.). Something like PFAM HMMs I'm thinking.
@ababaian said "this belongs to coordinates ~21,000-23,000 on the pan-genome" This kind of system works well for SSU genes and immunoglobulins, would be ideal here. It may not be possible, or a good idea, if there is gene gain or loss, or mosaic / chimeric viruses in the family. First step is to parse into functional units, then ask if domain organization is well conserved.
Related note: naively projecting onto a pan-genome with 32 bins (summarizer strategy) is robust against the complications mentioned in my last comment.
@rcedgar If you want to take this on, awesome! I might suggest some tools here: Prokka, for performing virus-oriented gene calling and a basic annotation using UniProt EggNOG-Mapper, which has viral HMMs and DIAMOND protein databases built already. Also, check out these resources from Steven Hallam: https://github.com/ababaian/serratus/wiki/Serratus-Annotation
We now have VADR for annotating CoVs. Do we still want to annotate the reference CoV genomes?
If VADR is the tool we're going to use in production, then would be great validation to provide VADR annotations of the ~800 full-length Cov genomes currently in GenBank. For the RefSeqs especially, this will show the strengths and weaknesses of the method. Edit -- Actually I guess not the RefSeqs because they are used as references by VADR IIRC.
To better evaluate read hits to the reference genomes, we need to have a quality, consistent annotation of those genomes, to see where the reads are hitting, and whether they are significant.