COMBINE-lab / simpleaf

A rust framework to make using alevin-fry even simpler
BSD 3-Clause "New" or "Revised" License
44 stars 3 forks source link

sequencing saturation #157

Closed wangjiawen2013 closed 1 month ago

wangjiawen2013 commented 1 month ago

Hi, I can get sequencing saturation curve using cellranger from 10x genomics single cell data. How to get it using simpleaf ? This is the definition of sequencing saturation (https://kb.10xgenomics.com/hc/en-us/articles/115003646912-How-is-sequencing-saturation-calculated):

How is sequencing saturation calculated?
Question: How is "Sequencing Saturation" calculated?

Answer: The web_summary.html output from cellranger count includes a metric called "Sequencing Saturation". This metric quantifies the fraction of reads originating from an already-observed UMI. More specifically, this is the fraction of confidently mapped, valid cell-barcode, valid UMI reads that are non-unique (match an existing cell-barcode, UMI, gene combination).

The formula for calculating this metric is as follows:

Sequencing Saturation = 1 - (n_deduped_reads / n_reads)

where

n_deduped_reads = Number of unique (valid cell-barcode, valid UMI, gene) combinations among confidently mapped reads. 

n_reads = Total number of confidently mapped, valid cell-barcode, valid UMI reads.

Note that the numerator of the fraction is n_deduped_reads, not the non-unique reads that are mentioned in the definition. n_deduped_reads is a degree of uniqueness, not a degree of duplication/saturation. Therefore we take the complement of (n_deduped_reads / n_reads) to measure saturation.

And this is an example plot: image

wangjiawen2013 commented 1 month ago

And this is the way to generate the plot:

This plot shows the Sequencing Saturation metric as a function of downsampled sequencing depth in mean reads per cell, up to the observed sequencing depth. Sequencing Saturation is a measure of the observed library complexity, and approaches 1.0 (100%) when all converted mRNA transcripts have been sequenced. The slope of the curve near the endpoint can be interpreted as an upper bound to the benefit to be gained from increasing the sequencing depth beyond this point. The dotted line is drawn at a value reasonably approximating the saturation point.

Could we subsample the RAD file outputed by simpleaf ?

rob-p commented 1 month ago

Hi @wangjiawen2013,

The necessary values to compute these plots are in the set of output features of alevin-fry. I'm cc'ing @ygao61 and @csoneson who have created such plots from the alevin-fry output before and who can instruct you as to the best way to generate such a plot.

--Rob

ygao61 commented 1 month ago

Hi @wangjiawen2013,

Below are the steps you can follow to generate the sequence saturation plot:

--Yuan

wangjiawen2013 commented 1 month ago

Got it. Thanks!