ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
254 stars 33 forks source link

Detect mRNA leaders (and trailers?) #147

Closed rcedgar closed 4 years ago

rcedgar commented 4 years ago

@asl's team drew our attention to the issue of mRNA leaders. Transcripts of active Cov viruses look like this:

image

Fig 1 in this paper. The leader is the pink rectangle at the beginning of each type of transcript.

See this preprint for a comprehensive analysis of the Cov-2 transcriptome showing a similar pattern.

@ababaian says this is very important because it indicates an active infection as opposed to (what? a passive virus which isn't making transcripts so we wouldn't see it at all? I don't understand this part).

As I understand it, the leader sequence originates at the start of the Cov genome, and is spliced onto the beginning of ~7 different types of mRNA, each of which starts at the beginning of a CDS and terminates at the 3' end of the genome. The leader sequence is ~70nt long.

Leaders can cause loops in the assembly graph, This suggests that active viruses can be detected from the graph topology.

Another possible method would be to look at coverage and soft-clipping in the bowtie2 alignments. The leader sequence should have higher coverage than the start of the CDS because the leader appears in all mRNAs. The alignments should be soft-clipped at the splice junction. If I'm understanding this well enough, then I believe leaders should induce a strong signal with a soft-clip boundary and drop in coverage.

It seems an automated method for detecting leaders would be useful. If so, the next step should be to identify a few positive and negative test cases as a benchmark for algorithm development.

Are trailers biologically informative or problematic (e.g. for assembly?). At first glance, doesn't look like it for Covs but perhaps for other viruses?

ababaian commented 4 years ago

Trailers in the case of CoV is polyA , it's important in the sense that by reaching it we can have reasonable certainty we have achieved the 3' end of the virus. The same goes for leader and the 5' end.

Thus a complete genome for our search can be broadly defined as Leader + ORF1ab + Spike + Nucleoprotein + polyA.

The 'non-biologically' active version of this (and this is not a perfect analysis by any means) would be the case in which we are detect an RNA viral genome in a viral particle but that is not a productive infection (i.e. it's in the gut of a pig, but it is not entering the pig epithelial cells. Or it enters the cells but is not able to undergo transcription due to host-incompatibility etc...). Not perfect but it does give us a layer of information which other approaches certainly will miss (i.e. kmer based analyses).

rcedgar commented 4 years ago

Apparently there are also several short post-transcriptional RNA modifications in the mRNAs. So our assemblies would be more accurately described as a consensus mRNA rather than a genome. From my outsider perspective I would say this is close enough to a genome sequence; but possibly Genbank will care?

rcedgar commented 4 years ago

Apparently the assemblers generally don't insert the leader sequence except at 5' so this is not an issue in practice, closing issue.