Closed Jiayi-Zheng closed 1 year ago
Hi @Jiayi-Zheng , this is a great question.
Two things:
It looks like the FOX family genes are a good example. Seurat's heatmap plot looks like it shows a lot more variability in the raw plot than in the cellbender output plot. But I have a few questions:
I suspect that a lot of the reason that the plots for cellbender outputs for the NOTCH family and the NKX family look strange is that the counts are so low to begin with in the raw data, that you're in the regime of 0, 1, 2, or 3 counts per droplet. That's what would account for the discrete nature of the colors in the plot for the raw data.
Removal of background noise from entries in the count matrix that start out as 0, 1, 2, or 3 in the raw data is a very hard problem... if you do find that cellbender seems to be overcorrecting on these low-count genes you're interested in, then there is one more possible alternative you could try: you could exclude low-count genes (below some kind of cutoff you impose) from the cellbender analysis, and only use the cellbender output data for genes that are above your cutoff. There is nothing wrong with this in principle, in my mind. It is easier for cellbender to denoise highly-expressed genes. And highly expressed genes also happen to contribute much more to background noise. Lowly expressed genes contribute very little to background noise, and cellbender should not change their counts drastically. Again, versions before 0.3.0 lack an explicit mathematical guarantee of this (though it should still do the best it can)... but we are trying to make it a real guarantee in v0.3.0.
Overcorrection is something we thought very hard about in v0.3.0, and so there more guarantees that we are not overcorrecting (at least, not more than the specified false positive rate --fpr
).
Closed by #238
Hello, there is this sample that I re-did analysis after learning about cellbender. I was checking for some gene family expressions (I'm working on developmental bio so it's mainly growth factors family) on other scRNA-seq datasets when I thought difference might come from cellbender processing and came to check. This is the result (I'm using Seurat to do heatmaps, it's a human fetal sample) when I tried to grep all FOX family genes from dataset, the graph above is unprocessed, the graph below is after cellbender processing, I did not do other correction to count matrix aside from general normalization and cell cycle regression by seurat tutorial.
NOTCH family (above pre-process, below processed): NKX family (NKX2-1 is a key marker and something we expected for expression):
The report from cellbender looked pretty normal. The training progress had a downward curve but went back up smoothly at the later stage. GW15_CellBender.pdf
My question arises from that I was doing analysis on some human stem cell culture (cellbender processed), and was comparing the sample expression profile to some human fetal dataset I analyzed previously (without cellbender), and saw this pattern (as above). Initially I thought it might be problem with in vivo and in vitro growth environment, but then realized I've actually got some analyzed dataset for comparison and realized the issue).
Wonder if other people have seen similar things?
The command line I used for cellbender is the one as example, the cell estimation number is modified according to estimation from cellranger report. The empty droplet range was also set by the curve and instruction accordingly.
Going back to my topic of discussion. One way of checking for ambient RNA is, of course, with the knowledge of certain RNAs should or should not be expressed in certain cells. However, when dealing with cells that we are unsure about its content, is there some way to check for the possibility of overcorrection?
Thank you so much!