ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
250 stars 32 forks source link

Align poly-A tail reads against assembly to identify putative 3' of genome #222

Closed taltman closed 3 years ago

taltman commented 3 years ago

If we can clearly identify the start and the end of a CoV genome, we could be more confident in labeling a given assembly as complete or partial, and also for helping to disentangle very complicated CoV assembly graphs. At first, we thought that we could use 5' & 3' UTR models, but these regions lacked adequate conservation. While the rest of the 5' end is full of well-defined and predictable protein domains, the 3' end of the genome has a more complicated arrangement of smaller CDS, with varying order.

@asl and I discussed the following procedure:

  1. Identify reads with a poly-A tail
  2. Trim off the poly-A tail portion of the read to avoid non-specific alignment
  3. Align the reads against the assembly

When clusters of these reads all align to the end of the same contig in an assembly, we can infer that this is the poly-A tail region. This could tell us where the 3' end of the genome is.

@rcedgar @rchikhi @ababaian, clue me in on the biology that I'm butchering here.

taltman commented 3 years ago

... could this also identify the 3' end of alternative viral transcripts?

rchikhi commented 3 years ago

note: all the reads I downloaded were immediately polyA-trimmed by fastp for assembly, so, can't use alignments in darth.tar.gz to get those reads unfortunately

taltman commented 3 years ago

Going forward, we could notate those reads' IDs before trimming. And then retain the IDs for doing this sort of analysis. And we could probably redownload the reads for the handful of tough cases where we need some help disentangling the assembly graph.