Closed taltman closed 3 years ago
... could this also identify the 3' end of alternative viral transcripts?
note: all the reads I downloaded were immediately polyA-trimmed by fastp for assembly, so, can't use alignments in darth.tar.gz
to get those reads unfortunately
Going forward, we could notate those reads' IDs before trimming. And then retain the IDs for doing this sort of analysis. And we could probably redownload the reads for the handful of tough cases where we need some help disentangling the assembly graph.
If we can clearly identify the start and the end of a CoV genome, we could be more confident in labeling a given assembly as complete or partial, and also for helping to disentangle very complicated CoV assembly graphs. At first, we thought that we could use 5' & 3' UTR models, but these regions lacked adequate conservation. While the rest of the 5' end is full of well-defined and predictable protein domains, the 3' end of the genome has a more complicated arrangement of smaller CDS, with varying order.
@asl and I discussed the following procedure:
When clusters of these reads all align to the end of the same contig in an assembly, we can infer that this is the poly-A tail region. This could tell us where the 3' end of the genome is.
@rcedgar @rchikhi @ababaian, clue me in on the biology that I'm butchering here.