Closed schorlton closed 1 year ago
The sum isn't the same because each base of a given read can be assigned to more than one contig, especially when it aligns to the overlap between two contigs.
For example, the +++++
region of the following read would be double-counted:
read
-------+++++--------
======================
================
contig 1 contig 2
This double-counted region could be fractionally assigned to the two contigs so that the c
value can be more reflective of the "true" read count. It is definitely something for me to work on in the future, but it is not a priority for me at the moment.
Thanks. Would this situation only arise if we fail to assemble a full transcript? With the above diagram, I'd expect it to optimally merge these contigs. I can see how if there was only a single read bridging these transcripts, that may be weak evidence to merge them, but I'm wondering when else you would encounter this situation. The double counting of bases seems to be quite significant for the number of transcript overlaps I expected.
In the above diagram, contigs 1 and 2 can still be join together into a transcript if there is sufficient overlap at the contig edges. Double-counting can also happen when reads map to more than one target sequence. For example, shorter reads representing shared exons in isoforms would likely multi-map.
Oh yeah, that makes sense. Then you get into the purpose of RNA-seq quantification tools. I suppose you're right, fractional coverage would be one solution to properly calculate coverage. Thanks for your insight and I hope you can bump this up your priority list just a bit!
Thanks again for your help!
When using RNA-Bloom, I expected (perhaps naively) that the sum of
coverage of each contig X length of same contig
would not exceed the number of bases in my input dataset. Or at least would be close, accounting for some margin of error. (Ie reads would only contribute to coverage of a single transcript, as they only originated from one transcript).However, this does not seem to be the case.
Here is my calculation of input bases from the transcripts FASTA: assembly_info.xls Sum of
coverage of each contig X length of same contig
= 354,613,891bpSo: is a single read having its bases assigned to multiple transcripts, and therefore increasing the coverage of multiple transcripts?
Please report
Command and version is the same as in this issue: https://github.com/bcgsc/RNA-Bloom/issues/18