Motivation for using consensus

rorymaizels commented 2 years ago

Hi guys,

This is a bit of a reply to your responses in my previous issue #9, regarding the new consensus step, and might not be totally appropriate for the issues section - but I've read through your technical information update on readthedocs, and I find it very surprising that there are substantially different sequences with the same UMI. I had some questions:

do you know if this occurs across multiple sequencing protocols (i.e. does this also happen in 10x data or sci-RNA-seq data), and how widespread is it across different transcripts?
do you have any idea of the mechanism of how this might be happening?
do you see empirically that using consensus significantly improves the detection of conversion sites compared to just using align?

I'm sure that the answers to these questions are all in the works, I just wondered if you could give some rough ideas of your thinking, so I know whether to use consensus for my current analysis. If you want to email me to discuss feel free, my email is rory.maizels@crick.ac.uk

Thanks, Rory

rorymaizels commented 2 years ago

further to the above, could you share code used to produce extended figure 1B that shows the different transcript mappings and conversion coverages for a particular transcript?

Lioscro commented 2 years ago

Hi, @rorymaizels, Sorry for the delayed reply. I've discussed this issue briefly with my colleagues as well, and it seems like this is a common issue, especially for very deeply-sequenced libraries. However, I don't have any real data to show for this, since I did not work on the scNT-seq paper. I was merely citing their findings in the documentation. For your first two questions, it may be useful to reach out to the authors of that paper.

For your third question, it really depends on what method you use for UMI deduplication. Initially, UMI deduplication in dynast was performed as conversion-agnostically as possible. What I mean by this is that for a given set of reads with the same UMI, a single read was selected as the true RNA molecule based on the following criteria.

Read that maps to the transcriptome.
Read that has the highest alignment score.
Finally, read that have the most of the conversion(s) of interest.

However, we found that this led to a nearly 20% decrease in the number of labeled UMIs, compared to the data in the scNT-seq paper. The data also seemed far more noisy in downstream analyses (@Xiaojieqiu should be able to explain about this more). This difference was due to the fact that in that paper, the authors explicitly took into account these "same UMI, different sequence" reads by identifying a UMI as labeled if any of its constituent reads had a conversion. This way of UMI deduplication, however isn't perfect. You could imagine that if you sequence a library infinitely deeply, then just by chance you will get sequencing errors that match your conversion(s) of interest, and therefor every UMI will be called as labeled.

That was a long-winded way of saying: yes, we did observe a significant difference in detection of labeled UMIs. In the data that I've seen so far, consensus matches very closely to the scNT-seq pipeline implementation (difference is <1% in the number of UMIs), but has the benefit that it deals with the problem explicitly and does not suffer from the same issue were the library to be sequenced infinitely deeply.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

aristoteleo / dynast-release

Motivation for using consensus #11