broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
297 stars 54 forks source link

Cellbender on multiplexed chemistries #373

Open LinearParadox opened 4 months ago

LinearParadox commented 4 months ago

So I had a question about running on multiplexed chemistries. Things like 10x flex, as well as CITE seq. At least with 10x flex, you have a pool of probes that can barcode a few samples. It comes in 16 and 4 sample probe pools. I was curious on whether you should run cellbender on the demultiplexed outputs (meaning cellbender sample 1-Pool1, cellbender sample2-Pool1, etc.). Or whether you should run cellbender on the file of all the probes in the pool. Meaning cellbender Pool1.h5ad, and then seperate the samples after. Intuitively, I would think it would perform better on the second option, because the entire probe pool is loaded onto the machine together, and is essentially one run. However, I get some weird performance when I do this:

on one side, I get a run that looks like it went ok training wise, but seems to call too many cells:

image image

On the other hand, i get runs that seem like they went pretty poorly, and has steep drops in the training curve. I'm wondering if this is due to parameters that need to be adjusted to account for a larger sample, or whether we should run it on demultiplexed samples only.

JThomasWatson commented 6 days ago

Hi. I'm not a member of the CellBender team, but recently had the same issue with scFRP data. I do think it wants to operate on a sample basis. This plot is from running one of our pools simultaneously. Untitled After changing to running CellBender on one sample at a time, the plots look much more like expected. The dips in the training curve you see were also present, but went away after switching to single-sample runs. Untitled-1 I suspect the distributions of endogenous and exogenous counts get blurry when using a pool's aggregated data due to between-sample variance.