broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
297 stars 54 forks source link

Cell probabilities remain close to one #220

Closed leprohonmalo closed 1 year ago

leprohonmalo commented 1 year ago

Hi,

Thank you for developing this tool ! I am trying CellBender (using latest docker image) on several human samples containing mostly T cells. We loaded 20 000 cells on the chip aiming to recover 7000-10 000.

newplot

For many samples (3/4) I get similar results where all droplets have a cell probabilities close to one.

learning

cell-prob

pca

I runned my sample with the following code :

cellbender remove-background \
    --input raw_feature_bc_matrix.h5 \
    --cellbender_default.h5 \
    --expected-cells 10000 \
    --total-droplets-included 35000 \
    --low-count-threshold 15 \
    --empty-drop-training-fraction 0.5 \
    --z-dim 100 \
    --z-layer 500 \
    --fpr 0.01 \
    --epochs 150 \
    --cuda

I saw similar issues suggesting to adjust parameters such as :
--expected-cells (tested 3000; 5000; 10 000) --total-droplets-included (tested from 10 000 to 40 000) --low-count-threshold (tested 5; 15) --empty-drop-training-fraction (tested 0.3; 0.5; 0.7) --z-dim (tested 50; 100) --z-layers (tested 250; 500)

However I could not manage to make it work. What could I do to improve the results ? Is there any issue that I am not seeing (for this specific sample the barcode rank plot maybe has an unusual shape and I recover a low amount of cells but that is not applicable to all of my samples) ? Thank you in advance for your help. Tell me if I can add any useful information.

aodainic7 commented 1 year ago

I am not a developer, but based on the graph you show, it seems like the algorithm does has not enough droplets to reach the empty plateau, it stops at 25k total droplets. Try going higher with the --total-droplets-included, like 70k.

sjfleming commented 1 year ago

Hi @leprohonmalo , Sorry you've been struggling with this: I am aiming to make the user experience better in v0.3.0... It seems like you have already tried quite a bit of parameter tweaking.

So, from looking at this sample's UMI curve (the top plot), I have kind of a hard time knowing exactly what's going on with the experiment. There are two possible scenarios:

  1. There are like 3k cells, and then there is an empty droplet plateau that goes out until approximately droplet 10k. The empty droplets have something like 500 UMI counts.
  2. There are like 10k cells, and they fall into two very different groups: those with 5k+ UMI counts, and those with about 500 UMI counts. Then the empty droplet plateau is after droplet 10k. The empty droplets have something like 10 UMI counts.

Evidence for scenario (1):

Evidence against scenario (1):

Evidence for scenario (2):

Evidence against scenario (2):

Are you measuring just RNA here, or are there antibody features or ATAC features as well?

leprohonmalo commented 1 year ago

Hi @sjfleming,

Thank you very much for your kind help. To answer some of your questions :

Did you not use all the 10x GEM barcoded beads ? We used all the 10x GEM barcoded beads. We runned this sample with a standard protocol.

Do you expect to see two populations of cells, maybe 3k of one type and 7k of another type, where the first cell type has 5k+ UMI counts and the second cell type has only 500 UMI counts? No, I would expect only one population of cells (T cells). I already analyzed single-cell data from sorted T cells in other projects and almost each time I had only one population. However once (in a previous experiment) a specific sample had a population with around 500 UMI. I ended up discarding it because other samples did not display that population and it was not clustering very well with the rest of the data. I also have 2-3 other samples that display a similar barcode rank plot in this experiment.

Is this all identical T cells, or are there other things in here like granulocytes, which can have very low UMI counts like that? No, it should be only T cells. However in some samples there is a minority of contaminating monocytes, erythrocytes or dendridtic cells. In this sample there is around 5 % of cells that I labeled as erythrocytes or platelets in downstream analysis.

The amount of ambient RNA would be very low, suggesting this was a very clean experiment. Is that consistent with your expectations? This is a bit hard to say. This sample is coming from frozen human blood. The samples were thawed, magneticly sorted with CD3, labeled with CITEseq antibody and loaded onto the chip. We made sure to have good cell viability (above 80 %). We loaded 20 000 cells into the 10X chip, hoping to get 10 000 cells downstream, but we only got 3000 for this specific sample. We processed 16 samples in this way and got high variability in number of cells (3000-10000). The cell counting was not done in an optimal way (manually) so it is not easy to judge from were the variability comes from.

Are you measuring just RNA here, or are there antibody features or ATAC features as well? There is both RNA and antibody features in this experiment. I did try --exclude-antibody-capture on a different sample (on which cellbender seemed to work better) but I got this error :

cellbender:remove-background: Inference procedure terminated early due to a NaN value in: mu, alpha, lam

The suggested fix is to reduce the learning rate.

I did not go further.

One thing I though was there could be around 3000 viable cells then around 7000 droplets with cell debris and then an empty droplet plateau after 10k. However I would maybe expect more than 10 UMI in empty droplets if that was the case. I also have no idea how much stress a cell get while going through thawing, magnetic sorting, labeling... Is it enough to trigger some apoptosis after we checked the viability ?

After what you said I would lean more towards scenario two. Comparing with mice samples I have with obvious background and an empty droplet plateau just below 100 I do not see a lot of background expression when looking at genes like HBB (although different organism is not optimal to compare). I also think it is more likely to have 2 different populations rather than few empty droplets. I will try to see if I can isolate this low UMI count population and see how it looks. Considering the two populations also make the number of cells closer to what we should expect (although I have also other samples with low amounts of cells).

Thank you again for you help. The documentation for cellbender is already quite helpfull (although I could suggest to add some example from github issues to help with troubleshooting) and you have been very active on github issues which helps a lot in trying to understand what is happenning. I really appreciate that.

sjfleming commented 1 year ago

Hi @leprohonmalo , thanks for your comments, glad to hear those kind things.

I think I tend to agree with your assessment after hearing your answers. If this was a "normal" 10x run with PBMCs, then I would expect to see an empty droplet plateau going out to around 70k droplets or so (which is what you see), and from blood with PBMC isolation, I'd also expect the experiment to be pretty clean: I would not expect to see 500 UMI counts in empty droplets (which is what I usually see from very difficult snRNA-seq preps from tough human tissues like heart or blood vessel walls).

(Another thing you might be able to check is this: in the raw data, if you look at droplets from the 500 UMI count plateau, do they express new genes that are NOT present in the droplets with 2000+ counts? If the 500 count plateau was really empty droplets, you would not expect to see new genes that didn't come from the cells in the experiment.)

So if scenario 2 is the more likely, then I would recommend trying to run cellbender as follows:

cellbender remove-background \
        --cuda \
    --input raw_feature_bc_matrix.h5 \
    --output cellbender_default.h5 \
    --expected-cells <number before the true empty droplet plateau> \
    --total-droplets-included <number after the empty droplet plateau> \
    --low-count-threshold 5 \
    --fpr 0.01

where <number before the true empty droplet plateau> would be like 10000 for the CellRanger UMI curve you show at the top, and where <number after the empty droplet plateau> would be like 35000. However, for that sample you show the cellbender output for, you would probably need larger values for both of those. It looks like there really are 25k+ cells, and the true empty droplets are not included in the analysis, because I do not see the UMI curve fall off to the ambient RNA level of around 10 or 20 or 30 counts.

The --low-count-threshold value of 5 will be important for your experiments, since the default value is 15 is too large for your experiments where the true empty droplets have only 10 counts or so.

leprohonmalo commented 1 year ago

Hi @sjfleming,

I looked at the 500 UMI population and it is definitly not a second population of cells as it displays mostly expression of mitochondrial genes (around 50 %). So I was wrong and it seems that scenario 1 is much more likely. What would you advise in this case ?

sjfleming commented 1 year ago

Well, I guess I would not be so sure that the high mitochondrial reads is evidence for scenario 1. In fact, if 50% mitochondrial reads is higher than the fraction in the really good cells, then I would suspect that the 500 UMI plateau is NOT empty droplets... precisely because there's something different going on in those 500 UMI count droplets than in the cells.

It's kind of like this sort of scenario https://github.com/broadinstitute/CellBender/issues/121#issuecomment-1379425406

It's hard to say what those droplets actually represent. In my mind, they could be either dying cells (which sometimes have a high mitochondrial read fraction) or they could be some kind of debris-containing droplets that for some reason contain a lot of mitochondria. But I think, for the purposes of CellBender's model, they are not empty. So I would think scenario 2 still applies.

leprohonmalo commented 1 year ago

Thanks for the explanation. I tested a few more runs with an increase in the parameters values you suggested and I am starting to get somewhere ! (also thanks for your suggestion @aodainic7. At first I thought it was too much and that 40000 already looked like in the empty droplet plateau from cellranger plot, but it proved quite relevant) Parameters :

cellbender remove-background \
        --input raw_matrix/CART_P88_raw_feature_bc_matrix.h5 \
        --output results/CART_P88_cellbender_50k_25k_3.h5 \
        --expected-cells 25000 \
        --total-droplets-included 50000 \
        --low-count-threshold 3 \
        --fpr 0.01 \
        --epochs 150 \
        --cuda

image

image

However the ELBO plot is not perfect, do you think that I can improve it with parameters like learning-rate ? Is it fine to go on with it ? I would say that previous epochs still converged.

sjfleming commented 1 year ago

So I can tell you what's happening here... and it's something I'm currently struggling to fix.

The turnaround in the learning curve is due to those initial few droplets with the highest counts getting called "empty" by cellbender. This is something that ideally should NOT happen. And I usually see a kind of dip in the learning curve when it does happen. I'm working on incorporating some more regularization to make sure it doesn't happen in future versions.

(The output -- apart from those few high count droplets -- should be fine though. ...apart from the handful of high-count droplets that are actually being called "empty": those droplets will have zero counts in the output, which is probably not desirable... though those droplets probably represent doublets anyway... but still it would be better if cellbender would keep them.)

For now, the options would be (1) to reduce the learning rate, maybe --learning-rate 2e-5, or (2) to reduce the number of epochs of training and just stop earlier, maybe --epochs 100. (Fewer epochs won't necessarily be equivalent to stopping early on the previous learning curve though... we use a learning rate scheduler, so changing the number of epochs changes the learning rate schedule, and it can be a little less predictable than you'd think.)

leprohonmalo commented 1 year ago

Hi, I tried what you suggested and it worked like a charm !

CART_P88_cellbender_50k_25k_5_2e5.pdf

Thank you very much for your help and your explanations. I hope I will be able to run the rest of my samples now. I think I'm good with closing the issue.