broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
284 stars 51 forks source link

High MT genes ratio at some samples #179

Closed dyinboisry4u closed 1 year ago

dyinboisry4u commented 1 year ago

Hi, I run cellbender(v0.2.2) for 60 samples, and all of the parameters are set manually after I check the cellranger barcode rank plot. but at some samples, I find mt gene ratio is higher than cellranger matrix what should I do? WechatIMG243 thanks!

sjfleming commented 1 year ago

Is this a plot where each point is

adata = # cellranger or cellbender [cell, gene] AnnData, for all droplets including empty droplets
mt_genes = # your mitochondrial gene list

y_value = adata[:, mt_genes].X.sum() / adata.X.sum()

?

I guess one of my questions would be: does the CellRanger calculation include the counts in empty droplets?

dyinboisry4u commented 1 year ago

Apologize for not being clear, actually the Anndata is the xxx_cellbender_filtered.h5(CellBender) and xxx/outs/filtered_feature_bc_matrix(CellRanger), both of them should have been done cell calling, so the Anndata should not include empty droplets.

Here is how I get this plot:

For CellBender anndata, I remove the "cells" with null counts. And then I use scanpy.pp.calculate_qc_metrics to calculate mt gene ratio

adataAll.var['mt'] = adataAll.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adataAll, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

then I calculate the median mt gene ratio for each sample to draw plot

mitoMedian = adataAll.obs.groupby('SampleID')['pct_counts_mt'].apply(lambda x: np.median(x)).to_frame('PctMtGeneMedian')
sns.scatterplot(data=mitoMedian, x="SampleID", y="PctMtGeneMedian", s=50, hue="SampleID", palette=myPalette)

Thanks for your reply~

sjfleming commented 1 year ago

Hi @dyinboisry4u , okay thanks for that explanation. I think I do have a hypothesis! But it would need to be tested :)

My hypothesis is that the extra "non-empty" droplets that are being retained in the CellBender filtered output (as compared to CellRanger's cell calling algorithm) are actually droplets with high MT fraction. This would in turn make it look like the median MT fraction per sample had actually gone up after CellBender.

You can see an example here where some of the low-UMI count "non-empty" droplets that CellBender finds are actually high MT-fraction droplets. In that experiment, they might represent dying cells.

If you consider the same set of droplets in both the CellRanger data and the CellBender data, then you will see that the mitochondrial read fraction only goes down, since CellBender only subtracts counts from the count matrix, it never adds counts.

dyinboisry4u commented 1 year ago

Hi @sjfleming , your hypothesis is correct, here is my test: I got the higher MT ratio samples and then got their extra "non-empty" droplets barcode to calculate MT gene fraction, these droplets actually with a high MT fraction:

WeChat483666befb59a5885db6054ef8155f80 WeChatbc39f656045d994ac91a5790a37d459e

I also have some questions:

  1. Cell Calling: i. For most of samples, CellBender could call more cells than CellRanger, and for some of samples CellRanger calls more (maybe caused by my parameters). So should I expect CellBender must call more cells than CellRanger? ii. Although in my situation these cells actually are droplets with high MT fraction, I want to rescue some low-UMI counts cell types (e.g. Neutrophils), what should I do? iii. Additionally, should I try to make CellBender include all of the CellRanger cells, what do you think of the CellRanger unique cells?
  2. QC-failure and weird-looking samples: I have some QC-failure and weird-looking samples, I test some parameters, are these correct? All the samples are loaded about 16,000 cells and may recover about 10,000 cells (10x v3.1 manual)
1 3 4 5

Thanks!~

sjfleming commented 1 year ago

Okay, yes, in that case, I think we have a satisfactory answer to the MT part. As for your other questions:

  1. Cell calling i. CellBender does typically call more cells than CellRanger, but that is just anecdotal evidence based on looking at hundreds of samples. There could be cases where that's not true. But in general, it's true. ii. Yes, although you see a high MT fraction among those "extra" droplets, that does not mean they are all garbage. Indeed, I very very often find that there are good cells (and cell types that would otherwise be lost) hiding in there. So, for the "what to do" question: I recommend starting with all the CellBender "non-empty" droplets, and then performing your own cell quality control. One thing you'll definitely want to include are some sensible cutoffs about MT fraction. Is there interesting biology in high MT counts in your samples? Or are high MT counts more likely dying cells? If dying cells, then it might make sense to use a pretty stringent threshold during cell QC. Don't worry about throwing out thousands of droplets in some samples. Sometimes that's what it takes... but you may see that you still end up with many hundreds or more of droplets that still pass QC, and they might be low UMI cell types. I also like to use exonic read fraction as a QC metric when doing snRNA-seq (if this is nuclear data). I also like to look at the entropy of gene expression as a QC metric. This is unpublished, as far as I know, but I am sure somebody else has thought of it. I find that metric to be quite useful, as droplets with low entropy are often debris or dying cells. iii. It is interesting that you see CellRanger unique cells. I usually do not see any. But it might be worth keeping them... at least to see what they look like...
  2. This is an important topic. Nobody wants to have entire samples fail QC, but it always does happen from time to time. I can give you my opinions on the samples you attached:
    1. Well the UMI curve is super unusual, but it looks fine to me. CellBender run looks good. Looks like you have several cell types with very different UMI counts in this sample. CellBender comes up with what I would consider to be the "right" answer.
    2. This is a case where I think CellRanger got it very wrong, but I think your sample looks okay. I think what's truly going on is that the empty droplet plateau is out near 100 UMI counts, well past your current --total-droplets-included 9000. This would mean the ambient RNA is more similar to sample 1. You might disagree... but you could try --expected-cells 10000 and --total-droplets-included as something like 30k or 40k. I would need to zoom in on that UMI curve, but you'd want to include droplets down to around 100 UMI counts.
    3. I suspect this is like 2. Is it possible there are more cells captured than you intended? I think the empty droplets here similarly have ~100 UMI counts.
    4. Now this one is very nearly a QC failure. On the UMI curve itself, we struggle to see any kind of elbow. BUT... the PCA of latent gene expression does show structure. So there are some cells in here. I think your "last run" is probably the best you're going to get. If you are keeping this sample, I would do stringent cell QC, and don't be afraid to throw away a lot of droplets.
    5. This one looks totally fine to me. Parameters look right, and CellBender did what I would expect.
dyinboisry4u commented 1 year ago

@sjfleming, thanks for your detailed explanation!

I also like to look at the entropy of gene expression as a QC metric. This is unpublished, as far as I know, but I am sure somebody else has thought of it.

I just found this: https://hal.science/hal-03378505 , by the way, how do you calculate the entropy of gene expression?


but you could try --expected-cells 10000 and --total-droplets-included as something like 30k or 40k.

Yes, for the sample ii. and iii., I had tried a larger --expected-cells and --total-droplets-included until my collaborator told me these samples have lots of "cell fragments" (before CellBender, may at 10x platform QC), here is former result:

WechatIMG247 WeChat25e42a6fa20ba9e813c0f7c853e36a75

The rank plot indeed shows an elbow, but if I use a larger --expected-cells and --total-droplets-included, the PCA embedding plot looks like a "tangle of black". And for sample iii., the test set curve looks super wobbly.

sjfleming commented 1 year ago

Re: entropy, I use the python package called ndd (https://github.com/simomarsili/ndd) to compute it for each droplet.

I actually think those runs with larger numbers of --total-droplets-included are better. Usually we like to see that CellBender is seeing some empty droplets toward the end of the total droplets included.

Even if some of those droplets end up being "cell fragments" that deserve to be eliminated during cell QC, it is not CellBender that should be eliminating them. From the perspective of CellBender, those are non-empty droplets.

dyinboisry4u commented 1 year ago

Hi, Thanks again for your reply! Now I understand and I'll use a larger --total-droplets-included. Last two quick questions here:

  1. compute entropy using ndd Should I use a normalized counts matrix or just use raw counts matrix for input, like this:
    # raw counts matrix
    np.apply_along_axis(ndd.entropy, axis=1, arr=adata.X.todense(), k=adata.n_vars)

    Sorry for not being familiar with bayesian entropy estimation :(

  2. For sample iii., test set curve looks super wobbly, is this acceptable?
sjfleming commented 1 year ago
  1. The raw counts are what you want. I do think what you propose would work fine. The only downside to that code is that you have to do adata.X.todense(), which is memory-hungry. As long as that fits in memory, this is a fine way to do it.
  2. Ideally the test curve would not be wobbly like that, but it doesn't really worry me for the following reason: it returns to a nice stable value one the model has converged. If it were the train curve wobbling like that, then I would be worried. But I think what is happening is that the test set here has been picked (randomly) in just such a way that, during training, some of the neural networks ended up overfitting to the training set, but this overfitting (i.e. inability to generalize to the test set) disappears once the model has converged. So I think in your case it's fine.
sjfleming commented 1 year ago

Train and test curve wobbly-ness has been addressed to a large extent in v0.3.0. Hopefully that holds true when people test it out.

Closed by #238