How to check whether cellbender might have overcorrected the count matrix?

Jiayi-Zheng commented 1 year ago

Hello, there is this sample that I re-did analysis after learning about cellbender. I was checking for some gene family expressions (I'm working on developmental bio so it's mainly growth factors family) on other scRNA-seq datasets when I thought difference might come from cellbender processing and came to check. This is the result (I'm using Seurat to do heatmaps, it's a human fetal sample) when I tried to grep all FOX family genes from dataset, the graph above is unprocessed, the graph below is after cellbender processing, I did not do other correction to count matrix aside from general normalization and cell cycle regression by seurat tutorial.

NOTCH family (above pre-process, below processed): NKX family (NKX2-1 is a key marker and something we expected for expression):

The report from cellbender looked pretty normal. The training progress had a downward curve but went back up smoothly at the later stage. GW15_CellBender.pdf

My question arises from that I was doing analysis on some human stem cell culture (cellbender processed), and was comparing the sample expression profile to some human fetal dataset I analyzed previously (without cellbender), and saw this pattern (as above). Initially I thought it might be problem with in vivo and in vitro growth environment, but then realized I've actually got some analyzed dataset for comparison and realized the issue).

Wonder if other people have seen similar things?

The command line I used for cellbender is the one as example, the cell estimation number is modified according to estimation from cellranger report. The empty droplet range was also set by the curve and instruction accordingly.

Going back to my topic of discussion. One way of checking for ambient RNA is, of course, with the knowledge of certain RNAs should or should not be expressed in certain cells. However, when dealing with cells that we are unsure about its content, is there some way to check for the possibility of overcorrection?

Thank you so much!

sjfleming commented 1 year ago

Hi @Jiayi-Zheng , this is a great question.

Two things:

We are in the process of building a few more guarantees into version 0.3.0 (not released yet), which will explicitly guarantee that each gene has no more counts removed than what the noise model can really account for. In version 0.2.0, the solution is certainly guided in that direction as much as we can, but there is no explicit mathematical guarantee, which there will be in version 0.3.0
That being said, version 0.2.0 or 0.2.1 should still be able to do a good job here. Validation is the hard part! But I might have a few suggestions...

It looks like the FOX family genes are a good example. Seurat's heatmap plot looks like it shows a lot more variability in the raw plot than in the cellbender output plot. But I have a few questions:

Have you plotted the z-scored gene expression there? Is that why the scale goes from -2 to +2 ? It might be more useful for the purposes of a cellbender comparison to plot the raw counts themselves. Because you're interested in transcription factors, and transcription factors are typically super lowly expressed, we can get a better picture of what's going on if we plot raw counts (integers).
Once you make a plot of the integer counts before and after, you can get an idea of how many counts are being removed per cell. You can also try to compute the average counts per cell (maybe broken down by cell type) before and after cellbender. Then it would be useful to also compute the average counts in empty droplets (making your best approximation). If the difference in raw versus cellbender is roughly similar to the average counts in empty droplets, then you can have a lot more confidence that the cellbender result is justified.

I suspect that a lot of the reason that the plots for cellbender outputs for the NOTCH family and the NKX family look strange is that the counts are so low to begin with in the raw data, that you're in the regime of 0, 1, 2, or 3 counts per droplet. That's what would account for the discrete nature of the colors in the plot for the raw data.

Removal of background noise from entries in the count matrix that start out as 0, 1, 2, or 3 in the raw data is a very hard problem... if you do find that cellbender seems to be overcorrecting on these low-count genes you're interested in, then there is one more possible alternative you could try: you could exclude low-count genes (below some kind of cutoff you impose) from the cellbender analysis, and only use the cellbender output data for genes that are above your cutoff. There is nothing wrong with this in principle, in my mind. It is easier for cellbender to denoise highly-expressed genes. And highly expressed genes also happen to contribute much more to background noise. Lowly expressed genes contribute very little to background noise, and cellbender should not change their counts drastically. Again, versions before 0.3.0 lack an explicit mathematical guarantee of this (though it should still do the best it can)... but we are trying to make it a real guarantee in v0.3.0.

sjfleming commented 1 year ago

Overcorrection is something we thought very hard about in v0.3.0, and so there more guarantees that we are not overcorrecting (at least, not more than the specified false positive rate --fpr).

Closed by #238

broadinstitute / CellBender

How to check whether cellbender might have overcorrected the count matrix? #173