Open vals opened 8 years ago
Hi Valentine,
This is very interesting. I think the answer to your question is "yes", but I'm not quite sure how one would use this information yet. Ideally, the inferred NPM (nucleotides per million) distribution should match the electropherogram (perhaps minimizing some metric like KL-divergence or JS-divergence). The challenge is that this is still rather coarse-grained information, in that there are likely many different solutions that would match this distribution well. That being said, it certainly seems like one could use this distribution to inform oneself when divergent solutions are being inferred. At the very least, one could imagine placing a prior (at the start of inference) on transcripts according to the mass of their corresponding length bin in the electropherogram — this might initialize the inference in a manner more likely to concord with the observed distribution. There may, of course, be other, better ways to make use of this information as well!
A lot of the times when we are assessing our samples before we move on to fragmenting cDNA in to fragments, we look at the distribution of full length cDNA using a Bioanalyzer.
See for example panel a of this figure
With the reference transcriptome, we know the distribution of transcripts with given lengths.
We can view the reference transcript length distribution as unweighted distribution of lengths, and the electropherogram as the distribution when weighing transcript lengths by their relative abundances.
Thus it seems the distribution of full length cDNA could be informative when inferring the TPMs (relative abundances) in a sample.
Do you think it could be possible to integrate with the quantification model?