Create model of Cov leader sequence to support annotation

rcedgar commented 4 years ago

Per @ababaian "At the 5' end we should be able to identify a 'Leader Sequence', possibly via a nucleotide HMM of the 5' UTR from known CoV complete genomes. This would be up to the first ATG or the first 100 nucleotides.... I don't know if we need this for GB or not, but for our own metrics this will be important as we need to classify 'complete' versus new [sic, should be 'partial'?]".

rcedgar commented 4 years ago

@ababaian The 5' ends of full-length genomes align with terminal gaps so a leader model looks to be impossible. Below is an alignment of the start of a few RefSeqs to illustrate the problem. Shall I close this issue?

NC_014470       ---------------------T TTAAA ATCTGTGTAG 
NC_034972       ------AGAAACAAGTAGTGTT TTAAA AACCTTCAAA 
NC_009988       -------------------GAC TTAAA GATATAATCT 
NC_010437       -------------------GAC TTAAA GATATAATCC

ababaian commented 4 years ago

Well we might not have absolute certainty, but this information is still better then nothing. Especially if we have the variable terminus from many genomes. It's not perfect, but it's better then no data. Unless I'm misunderstanding something.

rcedgar commented 4 years ago

Ok, then I think @taltman's annotation pipeline should blastn contigs against the first ~100nt of all known full-length sequences as a check it is approximately complete, agreed? If so I'll post the sequences to S3 and open an issue for Tomer.

rcedgar commented 4 years ago

What do the reads look like when you run off the end of a molecule -- presumably you get into the opposite adapter & in principle, the assembler could track this? Here I'm over my head with the biochemistry & algorithms, maybe we should check with Anton if this can be used as a test.

taltman commented 4 years ago

I've put in a query to Eric regarding VADR's ability to do so: https://github.com/nawrockie/vadr/issues/16

taltman commented 4 years ago

@nawrockie has pointed out that there are two "clans" in Rfam for the 5' and 3' UTRs, across different sub-taxa of CoV. Once I figure out how to download the HMM or CM models, I can add it to the pipeline.

taltman commented 4 years ago

https://xfam.wordpress.com/2020/04/27/rfam-coronavirus-release/

rcedgar commented 4 years ago

I think blastn to a database with the first ~100nt of the known full-length genomes will work better than RFAM. I will post the reference to s3 later today.

rcedgar commented 4 years ago

@taltman UTR sequences are utr5_id97.fa and utr3_id97.fa in s3://serratus-public/rce/utr/

They are technically not UTRs because I'm not sure where the CDS boundaries are, but the RFAM HMMs won't know either, The start of utr5 is the start of a genome and the end of utr3 is the end of a genome, so using these sequences will get us closer to the 5' and 3' terminals.

taltman commented 4 years ago

Hi @rcedgar ,

I'm not following your rationale. Why do you think that this blastn approach will be more sensitive than the handful of 5' and 3' UTR models made by experts at the Rfam project? My understanding is that HMMs are more sensitive than sequence homology search like Blast, and stochastic context free grammar covariance models are more sensitive than HMMs for profiles where the secondary structure will be more conserved than the primary sequence (as in the case of the 5' and 3' UTRs). Are you worried about local vs. global matching?

It will only take a minute to scan the entire CoV genome for these CM profiles, so I don't see a need to reinvent the wheel. Please let me know if I'm overlooking something.

rcedgar commented 4 years ago

I am worried about local vs. global matching. The RFAM HMMs are not gobal models of UTRs, they are local models of the most conserved sequence in the UTRs. There are two different kinds of sensitivity to consider here: (a) sensitivity to detecting the presence of a UTR in a highly diverged sequence, and (b) how many of the UTR bases are identified. HMMs are much better than blastn at (a) but will identify only a small fraction of the UTR bases, while blastn is much better at (b) when the divergence is low. Almost all of our assemblies have high identity (>90% to a known genome) and in these cases blastn will identify more of the UTR bases than the HMMs. In practice, it probably doesn't matter much, we can't identify the true 3' or 5' end either way, so I would be fine using either method; I figured it was probably just as easy to use blastn in which case it would be better.

rcedgar commented 4 years ago

Solved by blastn and/or RFAM HMMs, closing issue.

ababaian / serratus

Create model of Cov leader sequence to support annotation #176