I've been previously processing my data using Salmon Alevin. We've just got our results back from a SevenBridges analysis of our single cell sequencing data, and there's a substantial difference in PCR duplicate rate between the two platforms. Here are the numbers for SevenBridges (three samples were analysed separately):
From my understanding of the numbers, this suggests a PCR duplication rate of either 8.95X (comparing "used reads" with "deduplicated umis") or 6.16X (comparing "reads in eqclasses" with "deduplicated umis").
I had a go at trying to "manually" count (via awk/grep/sort/uniq/etc.) the cell barcodes and UMIs associated with the MT-ND1 gene:
ND1-matching reads with good cell barcode QC (barcodes linkable to expected cell barcodes), and good mapping (>60bp matching): 5292134
Unique cell/UMI pairs (subset from the above data): 1590232
=> PCR duplication rate ~= 3.32 reads per unique molecule
So it seems like the duplication rate might be closer to the SevenBridges estimate than the Alevin estimate.
The Alevin documentation seems to suggest that cell/transcript/UMI triplets are being used for UMI collapsing ("identify potential PCR/sequencing errors in the UMIs, and performs hybrid de-duplication while accounting for UMI collisions"), so I'm at a loss as to why the difference is this large.
I'm not too worried about numbers being different, but the UMI proportion is a concern to me, because sequencing is expensive, and it'd be good to have an idea of whether we need to further improve our library prep (which would certainly be the case for 9X duplication, but less important with 3-5X duplication).
Is anyone able to explain how the UMI deduplication is applied in Salmon Alevin, or why it might be different from other analysis platforms?
Which version of salmon was used? v1.9.0
How was salmon installed (compiled, downloaded executable, through bioconda)? Binary download from github
Which reference (e.g. transcriptome) was used? Gencode Human v42
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux bookworm/sid
Release: unstable
Codename: sid
Additional Information
The SevenBridges Rhapsody pipeline uses STAR v2.5.2b for annotating R2 reads, which are combined with R1 reads using custom python scripts (possibly involving calling HTSeq and Picard), then collapsed down to molecules using custom python scripts (apparently "apply rsec and dbec correction to the (sorted) valid annotation dataset").
I've been previously processing my data using Salmon Alevin. We've just got our results back from a SevenBridges analysis of our single cell sequencing data, and there's a substantial difference in PCR duplicate rate between the two platforms. Here are the numbers for SevenBridges (three samples were analysed separately):
Sample,Aligned_Reads,Molecules,RSEC_Molecules,Raw_Depth,RSEC_Depth,Saturation S1,1152169399,251097626,226949623,4.59,5.08,93.82 S2,1021623258,258701917,238156437,3.95,4.29,92.49 S3,1232889654,268962282,242417971,4.58,5.09,93.72
In other words, a PCR duplication rate of 4-5X (columns 4 & 5), depending on what statistic is chosen.
For salmon alevin, here's our metadata* for the whole run:
* total reads have been corrected by adding 2^32.
From my understanding of the numbers, this suggests a PCR duplication rate of either 8.95X (comparing "used reads" with "deduplicated umis") or 6.16X (comparing "reads in eqclasses" with "deduplicated umis").
I had a go at trying to "manually" count (via awk/grep/sort/uniq/etc.) the cell barcodes and UMIs associated with the MT-ND1 gene:
ND1-matching reads with good cell barcode QC (barcodes linkable to expected cell barcodes), and good mapping (>60bp matching): 5292134 Unique cell/UMI pairs (subset from the above data): 1590232 => PCR duplication rate ~= 3.32 reads per unique molecule
So it seems like the duplication rate might be closer to the SevenBridges estimate than the Alevin estimate.
The Alevin documentation seems to suggest that cell/transcript/UMI triplets are being used for UMI collapsing ("identify potential PCR/sequencing errors in the UMIs, and performs hybrid de-duplication while accounting for UMI collisions"), so I'm at a loss as to why the difference is this large.
I'm not too worried about numbers being different, but the UMI proportion is a concern to me, because sequencing is expensive, and it'd be good to have an idea of whether we need to further improve our library prep (which would certainly be the case for 9X duplication, but less important with 3-5X duplication).
Is anyone able to explain how the UMI deduplication is applied in Salmon Alevin, or why it might be different from other analysis platforms?
Expected behavior
I would expect the Salmon Alevin PCR duplication estimate to be similar to that of SevenBridges, or my semi-manual attempt (e.g. 3-5X).
Desktop OS/Version:
Linux musculus 5.18.0-4-amd64 #1 SMP PREEMPT_DYNAMIC Debian 5.18.16-1 (2022-08-10) x86_64 GNU/Linux
Additional Information The SevenBridges Rhapsody pipeline uses STAR v2.5.2b for annotating R2 reads, which are combined with R1 reads using custom python scripts (possibly involving calling HTSeq and Picard), then collapsed down to molecules using custom python scripts (apparently "apply rsec and dbec correction to the (sorted) valid annotation dataset").