Open GLLMU opened 1 year ago
Hi @GLLMU , thanks for your questions. I will try to answer them here and provide a few notes.
GPU doesn't work for my computer, so each sample took 12 ~ 48 hours for each trial. And it is really time consuming.
You might be interested in trying this, just for your information: #242
Do you think it is normal or abnormal?
It is abnormal to have to try so many different parameter settings, sorry about this!
... do I need to use the same learning-rate like 0.0000125 (if the results are good) for these 4 samples if I want to integrate them in the Seurat in the further analysis?
This is a great question! And the answer is "no"! You do not need to use the same learning rate on different samples. (You should use the same FPR!) But the other parameters like total droplets or expected cells or even the learning rate can definitely be different for different samples. And you can then jointly analyze the datasets together downstream.
Q1 and Q2
Actually, I wonder if you can try the following thing, just to see if it would work... CellBender v0.3.0 has methods for automatically finding the --expected-cells
and --total-droplets-included
. I wonder if it would work well in your case. You can try by using this command:
cellbender remove-background
--input CellBender_Input/Sample1/raw_feature_bc_matrix.h5
--output CellBender_Output/Sample1_CellBender_output.h5
--low-count-threshold 100
--learning-rate 2.5e-5
cellbender remove-background
--input CellBender_Input/Sample2/raw_feature_bc_matrix.h5
--output CellBender_Output/Sample2_CellBender_output.h5
--low-count-threshold 100
--learning-rate 2.5e-5
The --low-count-threshold 100
is there to tell CellBender that all droplets with counts < 100 should be ignored (which would be appropriate for sample 1 and sample 2 above... the droplets with < 100 UMI counts are way out past the "empty droplets"). This will help the auto-finding of expected cells and total droplets to work better.
But, if that does not work great either, then what you've done for sample 1 seems quite reasonable. You decreased the learning rate until you got a really good looking learning curve. For sample 2, it seems to be a bit trickier... decreasing the learning rate did not solve the problem. (If the auto-finding of parameters does not work well) you might try this:
cellbender remove-background
--input CellBender_Input/Sample2/raw_feature_bc_matrix.h5
--output CellBender_Output/Sample2_CellBender_output.h5
--expected-cells 10000
--total-droplets-included 30000
--learning-rate 2e-5
Q3
The gene warning is definitely something to consider. It looks like AY036118 really does decrease its counts quite a bit. 50% - 70% of the counts are being removed from cell-containing droplets (the fraction_removed_cells
column).
It looks like it's being removed quite a bit from both datasets. (Actually the top three genes removed are the same in the two samples.) This is typically a good sign, because CellBender is independently coming to the conclusion that the same genes are high-noise in two separate datasets (which might make sense if they are similar types of samples or were prepared in similar ways).
If it were me, I usually also try to get a feel for whether half of the counts of AY036118 being noise is "believable". One way to get a feel for this is to eyeball the columns n_raw
and n_raw_cells
. n_raw
is the total counts in the input dataset, and n_raw_cells
is the total counts in the input dataset for droplets that CellBender determined are non-empty (i.e. cells). In this case, for sample 1, I see 250k counts in the whole dataset and only 113k counts in cells. So that means that about half of the counts of AY036118 are in empty droplets if I just look at the raw data. This certainly makes it look more plausible that about half of the counts of AY036118 might actually be noise. And that's pretty close to what CellBender is finding. So I think I'd feel like the output is pretty reasonable.
I'd encourage you to do whatever other sanity checks you can think of, but if you're satisfied it's plausible, I would go ahead and use the results without worrying about the warning. The warning is there to encourage people to stop and double-check, but if the double-check looks okay, then the warning should be ignored.
Hi @GLLMU,
How did you generate these?
@sjfleming Thanks for the great tool. I really appreciate for the amazing calculation. When I used the tool for my 4 samples, I also met some problems.
Q1: 3 different learning-rate for sample 1 Cellranger count report for sample 1: I used the cell number from Cellranger count for expected-cells, and total-droplets-included number was inferred from the UMI curve according to this protocol: https://www.10xgenomics.com/resources/analysis-guides/background-removal-guidance-for-single-cell-gene-expression-datasets-using-third-party-tools. The script I used for CellBender: cellbender remove-background \ --input CellBender_Input/Sample1/raw_feature_bc_matrix.h5 \ --output CellBender_Output/Sample1_CellBender_output.h5 \ --expected-cells 18487 \ --total-droplets-included 50000 \ --fpr 0.01 \ --epochs 150 \ --learning-rate 1e-4 (tried different values) 1) Like others discussed, I used the default learning-rate 1e-4, then I got the following result:
2) And then I tried learning-rate 0.00005 to repeat the analysis, then I still got the same problem:
3) I tried learning-rate 0.000025 to repeat the analysis again, then I got a better result:
So, the learning-rate 0.000025 seems fit for this sample.
Q2: learning-rate 0.000025 doesn't fit sample 2 Cellranger count report for sample2:
The script I used for CellBender: cellbender remove-background \ --input CellBender_Input/Sample2/raw_feature_bc_matrix.h5 \ --output CellBender_Output/Sample2_CellBender_output.h5 \ --expected-cells 13261 \ --total-droplets-included 50000 \ --fpr 0.01 \ --epochs 150 \ --learning-rate 1e-4 (tried different values)
1) For default learning-rate 1e-4, I got following result with same warning:
2) Then I directly tried learning-rate 0.000025 for sample2, however the result still has the same warning:
So, I got a worse result for sample 2. And now I am running learning-rate 0.0000125 for sample 2 and waiting for the results. However the GPU doesn't work for my computer, so each sample took 12 ~ 48 hours for each trial. And it is really time consuming. Do you think it is normal or abnormal? The cells number are around 18k for sample1 and 13k for sample2. So, my question is do I need to use the same learning-rate like 0.0000125 (if the results are good) for these 4 samples if I want to integrate them in the Seurat in the further analysis? But it is really time consuming.
Q3: One gene decreased warning in all samples There is only one same gene decreased warning in 4 samples: WARNING: The expression of the highly-expressed gene AY036118 decreases quite markedly after CellBender. Check to ensure this makes sense! 1) for sample 1 with learning-rate 0.000025
2) for sample 2 with learning-rate 0.000025 Then does this gene warning affect the analysis results? Do I need to do any changes to rerun the samples? Or I can used the output for further analysis and ignore this gene warning?
So, these are my three main questions, and I plan to use Scrublet and seurat in the following analysis with the output of CellBender. Could you please help me to have a look of the problems when you have time? Thank you so much in advance.