Closed Munfred closed 3 years ago
Hi @Munfred , yes I would suspect that this is an artifact of overfitting. Although we're using an unsupervised approach with priors on latent variables that should not be super-susceptible to overfitting, it still is not advisable to train for too many epochs. I have never tested with nearly that many epochs. I typically use something in the range of 150 - 300. I have never surpassed 300, and I would suggest that you don't either.
I see that on some of the datasets, it looks like 5000 expected cells might be a little bit high. Ideally, you wouldn't need to fine-tune this parameter... however, it is used to obtain a prior on the expected number of counts in real cells. If there are only a few hundred real cells, but you say that expected cells is 5000, then the algorithm will think that the expected number of UMIs per cell is quite a bit lower than it is in reality. This can cause difficulties in learning cell probabilities accurately. I wonder if, in line with the CellRanger results, a value for --expected-cells
of 1000 might be more appropriate for most datasets. Especially the dataset that dips down near 1000 on the x-axis (blue-ish line)... that one might even be in the hundreds.
Also, if you're using the current v0.1.0 version, then you can hold out some test data by reaching into the code and changing this line:
That might be a way to double-check and see if overfitting is an issue. A value of 0.9
would mean that 10% of the dataset will be held out as test data. The learning curve will be plotted in the PDF output automatically.
Awesome thanks so much, I'll try it out with the held out data and see how it goes!
I'd suggest making 10% held out data a default. Thanks!
Hello, I tried again with expected-cells = 1000
, total-droplets-included = 15000
and epochs = 300
and an 80% train/test split, but it still predicts several datasets with 15k cells. From the train/test plots which I'm attaching some below, it is clear that it is overfitting. Perhaps some kind of automatic early stopping would be the best way to deal with this since I guess most people will be tuning parameters. I'm giving it one last shot with 75 epoch training since that seems to be about as good as it gets for most datasets
15000
15000
15000
15000
5691
5246
8849
10143
10705
4903
6656
15000
6948
7686
6015
5055
5665
15000
5436
8442
9271
Hmm after using only 75 training epochs the results are similar. I'm attaching a sampling of the output pdfs in case it is of interest. At this point I don't feel very confident using the cellbender barcodes, I will have to try out some more manual filtering options
15000
15000
15000
15000
5610
4357
8574
9522
9787
4759
5972
15000
6899
5819
5951
5035
5691
15000
5414
8425
9248
Hi @Munfred , yes it seems like remove-background
is really struggling with your datasets. As you can see by eye for example for the last dataset you showed, there is really no easy way to guess what is a cell versus what is empty... there is no obvious elbow or knee in that UMI plot whatsoever.
An easy dataset might look like this:
While a "hard" dataset would look like this:
What you have there is very difficult indeed. Are you sure there hasn't been some sort of quality control failure for these samples? It really does not look like you have many cells at all, at least to my eyes.
By the way, the latent space for that "hard" example above looks like this:
To me, your latent space being a sort of formless blob seems to be another possible hint that there may be some kind of a quality control failure. It's not definitive of course... for some experiments without much cellular diversity, the latent space will tend to just look like a blob.
Yes I do believe there is something funky with the data I'm working with. It is a single nuclei dataset of C. elegans, with a new dissociation protocol. It seems to have a lot of background (which is odd since nuclei are washed).
Thanks for the help. I'll give another shot with cellbender next time I have a less problematic dataset.
Hello, I just tried out CellBender on 21 datasets of 10x v2 single nuclei data that should be generally comparable (Cell Ranger retains ~1500-3500 barcodes in each) but I got wildly variable results with CellBender.
I used the following parameters:
expected-cells = 5000
,total-droplets-included = 15000
,epochs = 1500
.The number of barcodes retained by CellBender (from the
output.csv
file) are below. I'm surprised to see do many datasets with the maximum number of 15k, as well as 2 datasets with 0. Could this be a result of overfitting because I did 1500 epochs? In theoutput.pdf
I noticed that only a train score is provided, so it's hard to diagnose overfit...Here are the knee plots in case they help diagnose things. All datasets are pretty similar.