Sequence saturation UMI index & barcode rank features

danmoore1987 commented 5 years ago

Hi @Hoohm , Thank you again for the easy to use package!

Just a suggested enhancement feature that i think a lot of people might be interested in. Would be cool if in the CITE count report workflow you also generate a barcode rank and UMI saturation index plots/csv! :)

Hoohm commented 5 years ago

Hello @danmoore1987, why not, would you have any specific example to link here so that I can take a look?

danmoore1987 commented 5 years ago

Hi @Hoohm, Thanks for the reply!

Essentially some of the cell ranger outputs. So one to measure per barcode, how many UMI's were assigned.

The second is this. Its so we know if we need to sequence the CITE library deeper. Capture_saturation index

Cell ranger uses this formula for it: Sequencing Saturation = 1 - (n_deduped_reads / n_reads) where: n_deduped_reads = Number of unique (valid cell-barcode, valid UMI, gene) combinations among confidently mapped reads. n_reads = Total number of confidently mapped, valid cell-barcode, valid UMI reads.

Cheers!

bbimber commented 3 years ago

Just want to echo: this would be quite useful to know. we're actively trying to sort out whether some of our lesser libraries would benefit from more sequence depth.

I believe the run_report gives data to calculate saturation, at least globally, right? I think it would be valid to use 'Reads processed' as n_reads (we could adjust by percentage mapped?), and 'UMIs corrected' as n_deduped_reads, correct?

The per-cell plot above is informative. Presumably one could read/merge the 'read_count' folder and 'umi_count' folders to accomplish this, right?

bbimber commented 3 years ago

I need to sanity check the data, but this is derived from combining umi_count and read_count folders:

the R code is here: https://github.com/BimberLab/cellhashR/blob/2211878b792d7c0c5ff48e4183cdcd7a44dec8b8/R/Preprocessing.R#L278

danmoore1987 commented 3 years ago

This is great @bbimber !

I also checked out the rest of your cellhashR package for post-processing QC of libraries. Can't wait to give it a go! :)

bbimber commented 3 years ago

@danmoore1987 yes, i'm still surprised there arent more tools that exist that do what we're trying in cellhashR. we'd welcome any feedback. part of my goal is cellhashR is to specifically compare across different calling algorithms, since we find some do better or worse with different inputs.

With respect to saturation in particular, it would be great if you could confirm the tool is giving you believable values. I was surprised how non-saturated our libraries often were, but this wasnt something I had been tracking.

Hoohm commented 3 years ago

Ok, folks, I'm on holiday!!!

Let me take a look since I'm gonna work on this damn 1.5 release!!!

I'll keep you posted :)

Thanks for the code!

bbimber commented 3 years ago

@Hoohm No worries - I actually think we implemented this in cellhashR; however, I'd love to figure out features that make this work synergistically with Cite-Seq-Count.

Hoohm commented 3 years ago

Yes! That would be amazing. Can you send me an email so we can have a quick chat these days maybe?

Hoohm commented 3 years ago

Ok, so 1.5.0 is nearly finished. Running some tests on datasets to see how it matches the older version.

For your specific needs here is a non exhaustive list of changes that affects your code:

MTX format has changed. First column is now the TAG sequence, second column feature name. This means that Read10X runs by default on the right column (gene.colunm=2)
UMI MTX counts as well as the dense matrix have dropped the unmapped feature.
For technologies such as 10x v3 which uses two different barcodes for each cell when running mRNA and protein data, you can now provide the translation reference. First column, cell barcode in the mRNA data, second column cell barcode in the protein data. The MTX outputs will have two columns in the barcodes.tsv file, first, default will be the mRNA column, second will be the Protein data.

I think these are the only ones affecting your code, but I might be missing something. Let me know :)

bbimber commented 3 years ago

@Hoohm is there a heuristic code can perform to determine what format of input it's getting? for example, if we have a function for processCiteSeqCount(outputFolder), can this code automatically figure out what format it was passed?

Hoohm commented 3 years ago

Not sure which format you are referring to.

If you are talking about the translated version, then yes, the barcodes.tsv will hold two columns instead of two.

On Wed, 30 Dec 2020, 15:14 bbimber, notifications@github.com wrote:

@Hoohm https://github.com/Hoohm is there a heuristic code can perform to determine what format of input it's getting? for example, if we have a function for processCiteSeqCount(outputFolder), can this code automatically figure out what format it was passed?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Hoohm/CITE-seq-Count/issues/81#issuecomment-752636189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVO2DYYSL4TRCSO2FPIOTSXMYTLANCNFSM4IZF55AA .

bbimber commented 3 years ago

Maybe I misunderstood, but in your prior post didnt you say the MTX format is changing in version 1.5.0? Ideally, I would like cellhashR::ProcessCountMatrix() to just work with either the output from Cite-Seq-Count 1.5.0 or prior versions. I suppose I could read the matrix into memory with gene.column=1, test for the presence of 'unmapped', and if it's not present re-read using gene.column=2?

Hoohm commented 3 years ago

It's not changing that much.

I would really love to have a chat on zoom with you, would be interesting to have a back and forth about this since I'm not completely fixed on everything.

On Wed, 30 Dec 2020, 18:07 bbimber, notifications@github.com wrote:

Maybe I misunderstood, but in your prior post didnt you say the MTX format is changing in version 1.5.0? Ideally, I would like cellhashR::ProcessCountMatrix() to just work with either the output from Cite-Seq-Count 1.5.0 or prior versions. I suppose I could read the matrix into memory with gene.column=1, test for the presence of 'unmapped', and if it's not present re-read using gene.column=2?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Hoohm/CITE-seq-Count/issues/81#issuecomment-752692652, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVO2BBFOXOJI2NZHIWM7LSXNM5NANCNFSM4IZF55AA .

bbimber commented 3 years ago

sure - would be happy to. i didnt realize you worked at 10x until I googled your name just now. my email is bimber@ohsu.edu

Hoohm commented 1 month ago

Closing this for now.

Hoohm / CITE-seq-Count

Sequence saturation UMI index & barcode rank features #81