Closed dyinboisry4u closed 1 year ago
Is this a plot where each point is
adata = # cellranger or cellbender [cell, gene] AnnData, for all droplets including empty droplets
mt_genes = # your mitochondrial gene list
y_value = adata[:, mt_genes].X.sum() / adata.X.sum()
?
I guess one of my questions would be: does the CellRanger calculation include the counts in empty droplets?
Apologize for not being clear, actually the Anndata is the xxx_cellbender_filtered.h5
(CellBender) and xxx/outs/filtered_feature_bc_matrix
(CellRanger), both of them should have been done cell calling, so the Anndata should not include empty droplets.
Here is how I get this plot:
For CellBender anndata, I remove the "cells" with null counts. And then I use scanpy.pp.calculate_qc_metrics
to calculate mt gene ratio
adataAll.var['mt'] = adataAll.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adataAll, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
then I calculate the median mt gene ratio for each sample to draw plot
mitoMedian = adataAll.obs.groupby('SampleID')['pct_counts_mt'].apply(lambda x: np.median(x)).to_frame('PctMtGeneMedian')
sns.scatterplot(data=mitoMedian, x="SampleID", y="PctMtGeneMedian", s=50, hue="SampleID", palette=myPalette)
Thanks for your reply~
Hi @dyinboisry4u , okay thanks for that explanation. I think I do have a hypothesis! But it would need to be tested :)
My hypothesis is that the extra "non-empty" droplets that are being retained in the CellBender filtered output (as compared to CellRanger's cell calling algorithm) are actually droplets with high MT fraction. This would in turn make it look like the median MT fraction per sample had actually gone up after CellBender.
You can see an example here where some of the low-UMI count "non-empty" droplets that CellBender finds are actually high MT-fraction droplets. In that experiment, they might represent dying cells.
If you consider the same set of droplets in both the CellRanger data and the CellBender data, then you will see that the mitochondrial read fraction only goes down, since CellBender only subtracts counts from the count matrix, it never adds counts.
Hi @sjfleming , your hypothesis is correct, here is my test: I got the higher MT ratio samples and then got their extra "non-empty" droplets barcode to calculate MT gene fraction, these droplets actually with a high MT fraction:
I also have some questions:
low-UMI counts cell types
(e.g. Neutrophils), what should I do?
iii. Additionally, should I try to make CellBender include all of the CellRanger cells, what do you think of the CellRanger unique cells?Thanks!~
Okay, yes, in that case, I think we have a satisfactory answer to the MT part. As for your other questions:
--total-droplets-included 9000
. This would mean the ambient RNA is more similar to sample 1. You might disagree... but you could try --expected-cells 10000
and --total-droplets-included
as something like 30k or 40k. I would need to zoom in on that UMI curve, but you'd want to include droplets down to around 100 UMI counts.@sjfleming, thanks for your detailed explanation!
I also like to look at the entropy of gene expression as a QC metric. This is unpublished, as far as I know, but I am sure somebody else has thought of it.
I just found this: https://hal.science/hal-03378505 , by the way, how do you calculate the entropy of gene expression?
but you could try --expected-cells 10000 and --total-droplets-included as something like 30k or 40k.
Yes, for the sample ii. and iii.
, I had tried a larger --expected-cells
and --total-droplets-included
until my collaborator told me these samples have lots of "cell fragments" (before CellBender, may at 10x platform QC), here is former result:
The rank plot indeed shows an elbow, but if I use a larger --expected-cells
and --total-droplets-included
, the PCA embedding plot looks like a "tangle of black". And for sample iii.
, the test set curve looks super wobbly.
Re: entropy, I use the python package called ndd
(https://github.com/simomarsili/ndd) to compute it for each droplet.
I actually think those runs with larger numbers of --total-droplets-included
are better. Usually we like to see that CellBender is seeing some empty droplets toward the end of the total droplets included.
Even if some of those droplets end up being "cell fragments" that deserve to be eliminated during cell QC, it is not CellBender that should be eliminating them. From the perspective of CellBender, those are non-empty droplets.
Hi, Thanks again for your reply!
Now I understand and I'll use a larger --total-droplets-included
. Last two quick questions here:
ndd
Should I use a normalized counts matrix or just use raw counts matrix for input, like this:
# raw counts matrix
np.apply_along_axis(ndd.entropy, axis=1, arr=adata.X.todense(), k=adata.n_vars)
Sorry for not being familiar with bayesian entropy estimation :(
sample iii.
, test set curve looks super wobbly, is this acceptable?adata.X.todense()
, which is memory-hungry. As long as that fits in memory, this is a fine way to do it.Train and test curve wobbly-ness has been addressed to a large extent in v0.3.0. Hopefully that holds true when people test it out.
Closed by #238
Hi, I run cellbender(v0.2.2) for 60 samples, and all of the parameters are set manually after I check the cellranger barcode rank plot. but at some samples, I find mt gene ratio is higher than cellranger matrix what should I do? thanks!