No cells found when excluding the ADT

aodainic7 commented 1 year ago

Hey everyone, I am testing your tool to check for contamination on my scRNAseq+CITEseq experiment. I have one issue and some questions:

When running the pipeline on all the features and antibody, everything works. One I add the --exclude-antibody-capture, I get the error: 'No cells found! Cannot compute expected FPR.' This is the console output:

cellbender:remove-background: Command:
cellbender remove-background --input CellRanger/C120_batch3_5/out
s/multi/count/raw_feature_bc_matrix.h5 --output /CellBender/mytest/batch3_5_cellbender_out_v2_wo_adt.h5 --cuda --expected-cells 25000 --total-drop
lets-included 130000 --fpr 0.01 --exclude-antibody-capture --epochs 200
cellbender:remove-background: 2023-06-01 15:56:47
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from file /CellRanger/C120_batch3_5/outs/multi/count/raw_feature_bc_matrix.h5
cellbender:remove-background: CellRanger v3 format
cellbender:remove-background: Trimming dataset for inference.
cellbender:remove-background: Excluding 143 features that correspond to antibody capture.
cellbender:remove-background: Including 28676 genes that have nonzero counts.
cellbender:remove-background: Prior on counts in empty droplets is 89
cellbender:remove-background: Prior on counts for cells is 3693
cellbender:remove-background: Excluding barcodes with counts below 44
cellbender:remove-background: Using 25000 probable cell barcodes, plus an additional 105000 barcodes, and 22494 empty droplets.
cellbender:remove-background: Largest surely-empty droplet has 50 UMI counts.
cellbender:remove-background: Running inference...
cellbender:remove-background: Inference procedure terminated early due to a NaN value in: mu, lam
The suggested fix is to reduce the learning rate.
cellbender:remove-background: 2023-06-01 15:57:20
cellbender:remove-background: Preparing to write outputs to file...
Traceback (most recent call last):
File "/.conda/envs/CellBender/bin/cellbender", line 33, in <module>
sys.exit(load_entry_point('cellbender', 'console_scripts', 'cellbender')())
File "/CellBender/cellbender/base_cli.py", line 101, in main
cli_dict[args.tool].run(args)
File "/CellBender/cellbender/remove_background/cli.py", line 109, in run
main(args)
File "/CellBender/cellbender/remove_background/cli.py", line 204, in main
run_remove_background(args)
File "/CellBender/cellbender/remove_background/cli.py", line 174, in run_remove_background
save_plots=True)
File "/CellBender/cellbender/remove_background/data/dataset.py", line 534, in save_to_output_file
inferred_count_matrix = self.posterior.mean
File "/CellBender/cellbender/remove_background/infer.py", line 58, in mean
self._get_mean()
File "/.conda/envs/CellBender/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/CellBender/cellbender/remove_background/infer.py", line 357, in _get_mean
raise ValueError('No cells found!  Cannot compute expected FPR.')

Here is the output when I do not omit the ADT, and the pipeline works: batch3_5_cellbender_out_v2.pdf

Can you please take a look at the outputs of the batch3, and tell me if the learning curves are optimal, since there are the drops in the middle. Should I increase the number of expected cells?

batch3_1_cellbender_out_v2.pdf batch3_2_cellbender_out_v2.pdf batch3_3_cellbender_out_v2.pdf batch3_4_cellbender_out_v2.pdf

I have compared the results from cellranger and cellbender batch3_1, to asses how much the reads get corrected, but it seems to be one cell only for the gene which is most different between cellranger and cellbender. Is this due to the fact that there is low contamination or the learning was not optimal? See the attached pdfs for IGKV1−33. The RAW counts are from cellranger, the RNA counts are from cellbender output. bender_vs_ranger_counts_v2.pdf features_bender_vs_ranger_v2.pdf

Cheers!

sjfleming commented 1 year ago

Hi @aodainic7 , I think I have a suggestion that can help you!

So right now, cellbender is identifying what I believe to be cells AND empty droplets as "cells". I think those regions on the UMI curve with ~300 counts (like batch3_2 from droplet 30k to droplet 100k) are the empty droplets. So you have several hundred counts of ambient RNA in empty droplets, and cellbender can probably help out a lot!

But currently cellbender is not identifying the empty droplets correctly. This can probably be fixed by changing two things:

make --total-droplets-included smaller. It should be pretty much the first droplet where you're 100% sure everything past that is empty.
use the --low-count-threshold parameter. This will help cellbender more easily identify the empty droplets. In your case, I would set the parameter to 100, telling cellbender than any droplet with < 100 UMI counts is "past the empty droplet plateau" and should be ignored completely. Those droplets probably represent cell barcode sequencing errors, and they are not the "real" empty droplets.

So try this:

cellbender remove-background \
    --input CellRanger/C120_batch3_5/outs/multi/count/raw_feature_bc_matrix.h5 \
    --output /CellBender/mytest/batch3_5_cellbender_out_v2.h5 \
    --cuda \
    --expected-cells 20000 \
    --total-droplets-included 35000 \
    --fpr 0.01 \
    --low-count-threshold 100

aodainic7 commented 1 year ago

Hey Stephen, thanks for the input. I have increased the threshold and I got some decent correction. The results look very promising. I subsetted the T cells and compared the expression of the most changed genes, and to my surprise I found the contamination genes:

Same goes for ADT, the B cell markers get reduced on T cells, but not the T cell markers(which is amazing):

I also see a reduction in HTO, and my question is should I exclude these from the correction? What is your experience?

Here is the mean change in counts per cell((1-cellbender filtered divided by the cellranger)*100) per assay

thanks in advance, Cheers Alex

sjfleming commented 1 year ago

Hi @aodainic7 , are those HTOs that you mention "hashtag oligos" like this kind of thing?

If this is what you're talking about, I'd be interested to hear more about your thoughts on this. I have not used these myself, and unfortunately I don't have any experience. The idea is to be able to pool cells across donors by having an (antibody-labeled) oligo barcode whose barcode encodes donor identity, right? And then you load cells from multiple donors into the same "sample", right?

If the HTOs are subject to the same sort of noise mechanisms as the antibody features (and I would expect this to be the case), then maybe running CellBender on those HTO features does make sense.

What I'd do if it were me would be to compare the raw HTO counts and the CellBender HTO counts. And specifically I'd be really interested to see if the conclusions you draw about demultiplexing cells back to their specific donors end up being the same or different when CellBender is used. For example, is it easier for the demultiplexing algorithm to do its job after CellBender cleanup? Does CellBender go too far? Not make a big difference?

I would think it might be kind of like the human and mouse cell benchmark we use: you might see that donor assignment for singlet cells becomes more obvious, but you'd hope to see that true doublets remain doublets in terms of HTO counts after cellbender.

Okay actually, I had another thought that complicates this, although I'll leave what I've written above:

The HTOs do violate some of the assumptions of CellBender: namely the assumption that measured features are all part of a "cell state". Like RNA and ADT features are all part of the same picture of biological expression of that cell. And we learn a prior for that cell state using an autoencoder. This prior helps us do a better job denoising. The thing is... donor is NOT correlated with cell state. Any prior we learn about HTOs will at best be random, and at worst could be misleading (though probably not).
CellBender might do the right thing... which would be to just try to uniformly subtract the same number of "ambient" HTO counts from each cell, regardless of cell type. But the more I think about it, the more I suspect that it might be better to just not analyze the HTO features with CellBender.
I see for HTOs 1-5 above, you've got a range of CellBender "Pct_Diff". Do these percentages match up with the proportion of cells from each donor? That is to say... did donor 5 have the most cells in the sample? And did donor 1 have the least? (And did donor 1 have like 1/3 the cells of donor 5?)
Another way to look at this would be to look at Pct_Diff breakdown by (donor, cell type) rather than just (donor). If you see that donor HTOs are being "unevenly" removed from different cell types (like if B cells have all the donor 1 tags removed, and T cells have all the donor 2 tags removed... assuming this is not expected based on your experimental setup), then this is a problem, and I'd use the raw data for the HTO features.

aodainic7 commented 1 year ago

Hello Stephan, exactly the same as is the publication, hashtag oligos for multiplexing. I wanted to investigate the questions you asked. I could not see a very strong effect on smaller cell groups rather in larger ones. The counts get "decontaminated" for one specific HTO, while the rest remain basically unchanged: Interestingly, the changes stay more or less consistent across cell types in the same sample (which is amazing!). Here is an example of one donor: The results look promising, do you have any other critical points I should check?

I have a suggestion, maybe someone would like to exclude the HTOs from the background removal, thus maybe introduce an option to specify when running cellbender. There is only the possibility for --exclude-antibody-capture, so maybe add --exclude-hashtag-oligos. Cheers!

sjfleming commented 1 year ago

Hi @aodainic7 , nothing else comes to mind, I don't think. I do think that excluding the HTOs might make more sense in your case. In v0.3.0 I will be changing --exclude-antibody-capture to --exclude-feature-types where the user can specify any valid feature type. (Currently it has to be one of the types allowed by 10x, which is ['Gene Expression', 'Antibody Capture', 'CRISPR Guide Capture', 'Custom', 'Peaks'].) When you create this dataset, do you run it through 10x CellRanger to get a count matrix? Does the feature_type show up as Custom?

sjfleming commented 1 year ago

That --exclude-feature-types input argument is now part of the v0.3.0 release.

broadinstitute / CellBender

No cells found when excluding the ADT #223