COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
773 stars 162 forks source link

Question on UMI deduplication / quantification of 3' tag data from bulk samples #306

Open tomsing1 opened 5 years ago

tomsing1 commented 5 years ago

tl;dr: 3-tag sequencing methods for bulk RNA samples contain known sample indices and UMIs and thus resembles sc-RNA-seq read formats. Do you have a recommendation on how to use Salmon and / or Alevin to quantify gene expression for this data type?

Congratulations on the recent alevin preprint! The new algorithm to deduplicate UMIs looks awesome. I am wondering if you had a recommendation on how to leverage it for 3' tag sequencing of bulk samples.

There are a number of protocols that focus on the 3' ends of transcripts to allow for cheap quantification of gene expression, e.g.

These methods combine conventional (known) sample-indices to label samples (or wells) with unique molecular identifiers (UMIs). (I found one question on this topic in the salmon issue tracker from back in 2016)

Here is the Drug-seq approach, for example:

Drug-seq

The resulting read data resembles that of single-cell approaches and requires deduplication of UMIs and quantification based on reads with a strong 3' bias. It seems analysis of this data could benefit a lot from the algorithms implemented in Alevin.

Can this data be analyzed with Salmon and / or Alevin? Are there any pitfalls that I should be aware off?

Many thanks for any feedback - and thanks again for making these great tools available to the community.

k3yavi commented 5 years ago

Hi @tomsing1 , Apologies for the slow response, I was out of country for a while.

Thanks for your kind words and starting a very interesting suggestion. It’s fascinating to see, how methods being used in single-cell RNA-seq is coming full circle back to the bulk RNA-seq experiments. We have to do some more digging to say clearly about the caveats of using Alevin with the mentioned 3’ bulk RNA-seq experiments but given the understanding from the picture of the shared image we don’t see any obvious show stoppers; although below mentioned concerns should be kept in mind while using Alevin for bulk data deduplication:

Alevin solves the problem pretty well for protocols where fragmentation of the cDNA molecule happens post PCR amplification. There might be some concerns about over-deduplication of the UMI if fragmenation happens before amplification. Although in current form, Illumina sample index can be given as an external whitelist to Alevin but user should be aware that Alevin performs a sequence correction step before starting any optimizations. Alevin is designed for droplets based protocols, where one end of Paired end read is just the CB/UMI (i.e. no read sequence) and therefore Alevin can’t optimally use the full paired end information of the bulk 3' protocol if its both end has read-sequence for example the ambiguous mapping resolution based on a previously/empirically known approximate fragment length.

We would be more than happy to help/discuss, how does the results look in bulk 3’ tagged protocols or if you have particular suggestions about what improvements can be done in Alevin.

antgomo commented 5 years ago

I am also interested in this approach. I have paired-end bulk-RNAseq with UMIs in order to avoid duplicates. I have three fastq's per sample : 1 UMI, 2 and 3 paired-end FASTQ My aim is if I can use alevin in this way

salmon alevin -l ISR -1 UMI.fq.gz -2 Sample_read_1.fq.gz Sample_read_2.fq.gz

Thanks in advance

ChenfuShi commented 5 years ago

Is there any plan to support this in salmon? We also have data generated using the quant-seq with UMIs and we have quite a few duplicates. What would you do? Thanks!

nsmackler commented 4 years ago

I second this. Any chance this will be possible? All it requires is passing a UMI fastq and a R1 (or R2) fastq from the 3' sequence. The additional bells and whistles for cellular barcodes can be dropped, so basically it's like a combination of salmon align and alevin to remove duplicate UMIs from reads mapped to the same gene/transcript.

karl616 commented 3 years ago

I would also be interested in a feature like this.