COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
779 stars 165 forks source link

UMI deduplication inconsistencies compared to SevenBridges analysis #808

Open gringer opened 2 years ago

gringer commented 2 years ago

I've been previously processing my data using Salmon Alevin. We've just got our results back from a SevenBridges analysis of our single cell sequencing data, and there's a substantial difference in PCR duplicate rate between the two platforms. Here are the numbers for SevenBridges (three samples were analysed separately):

Sample,Aligned_Reads,Molecules,RSEC_Molecules,Raw_Depth,RSEC_Depth,Saturation S1,1152169399,251097626,226949623,4.59,5.08,93.82 S2,1021623258,258701917,238156437,3.95,4.29,92.49 S3,1232889654,268962282,242417971,4.58,5.09,93.72

In other words, a PCR duplication rate of 4-5X (columns 4 & 5), depending on what statistic is chosen.

For salmon alevin, here's our metadata* for the whole run:

{
    "total_reads": 4579183639,
    "reads_with_N": 59361,
    "noisy_cb_reads": 977033360,
    "noisy_umi_reads": 152773,
    "used_reads": 3601938145,
    "mapping_rate": 54.155827599470509,
    "reads_in_eqclasses": 2479894797,
    "total_cbs": 37916487,
    "used_cbs": 137766,
    "no_read_mapping_cbs": 1,
    "final_num_cbs": 136196,
    "deduplicated_umis": 402386555,
    "mean_umis_per_cell": 2954,
    "mean_genes_per_cell": 896
}

* total reads have been corrected by adding 2^32.

From my understanding of the numbers, this suggests a PCR duplication rate of either 8.95X (comparing "used reads" with "deduplicated umis") or 6.16X (comparing "reads in eqclasses" with "deduplicated umis").

I had a go at trying to "manually" count (via awk/grep/sort/uniq/etc.) the cell barcodes and UMIs associated with the MT-ND1 gene:

ND1-matching reads with good cell barcode QC (barcodes linkable to expected cell barcodes), and good mapping (>60bp matching): 5292134 Unique cell/UMI pairs (subset from the above data): 1590232 => PCR duplication rate ~= 3.32 reads per unique molecule

So it seems like the duplication rate might be closer to the SevenBridges estimate than the Alevin estimate.

The Alevin documentation seems to suggest that cell/transcript/UMI triplets are being used for UMI collapsing ("identify potential PCR/sequencing errors in the UMIs, and performs hybrid de-duplication while accounting for UMI collisions"), so I'm at a loss as to why the difference is this large.

I'm not too worried about numbers being different, but the UMI proportion is a concern to me, because sequencing is expensive, and it'd be good to have an idea of whether we need to further improve our library prep (which would certainly be the case for 9X duplication, but less important with 3-5X duplication).

Is anyone able to explain how the UMI deduplication is applied in Salmon Alevin, or why it might be different from other analysis platforms?

salmon alevin -l ISR \
  -1 $(ls demultiplexed/squished_${machineID}*_R1_001.fastq.gz | sort) \
  -2 $(ls demultiplexed/${machineID}*_R2_001.fastq.gz | sort) \
  --mrna ${indexDir}/mt_genes.txt --rrna ${indexDir}/rRNA_genes.txt \
  -i ${indexDir}/${indexName} --expectCells ${expectCellCount} --whitelist goodSampleTagCells.txt \
  -p 10 -o salmon_1.9_cbc_whitelist_${projectID}_combined --tgMap ${indexDir}/txp2gene_${targetName}.txt \
  --umi-geometry '1[28-35]' --bc-geometry '1[1-27]' --read-geometry '2[1-end]'

Expected behavior

I would expect the Salmon Alevin PCR duplication estimate to be similar to that of SevenBridges, or my semi-manual attempt (e.g. 3-5X).

Desktop OS/Version:

Linux musculus 5.18.0-4-amd64 #1 SMP PREEMPT_DYNAMIC Debian 5.18.16-1 (2022-08-10) x86_64 GNU/Linux

No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux bookworm/sid
Release:        unstable
Codename:       sid

Additional Information The SevenBridges Rhapsody pipeline uses STAR v2.5.2b for annotating R2 reads, which are combined with R1 reads using custom python scripts (possibly involving calling HTSeq and Picard), then collapsed down to molecules using custom python scripts (apparently "apply rsec and dbec correction to the (sorted) valid annotation dataset").