broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
295 stars 54 forks source link

Wildly variable retained barcode list #73

Closed Munfred closed 3 years ago

Munfred commented 4 years ago

Hello, I just tried out CellBender on 21 datasets of 10x v2 single nuclei data that should be generally comparable (Cell Ranger retains ~1500-3500 barcodes in each) but I got wildly variable results with CellBender.

I used the following parameters: expected-cells = 5000, total-droplets-included = 15000, epochs = 1500.

The number of barcodes retained by CellBender (from the output.csv file) are below. I'm surprised to see do many datasets with the maximum number of 15k, as well as 2 datasets with 0. Could this be a result of overfitting because I did 1500 epochs? In the output.pdf I noticed that only a train score is provided, so it's hard to diagnose overfit...

Here are the knee plots in case they help diagnose things. All datasets are pretty similar. image

15000
15000
15000
15000
6001
5430
11420
0
11426
5448
6628
15000
8731
7871
7454
7139
0
15000
6024
8564
9622
sjfleming commented 4 years ago

Hi @Munfred , yes I would suspect that this is an artifact of overfitting. Although we're using an unsupervised approach with priors on latent variables that should not be super-susceptible to overfitting, it still is not advisable to train for too many epochs. I have never tested with nearly that many epochs. I typically use something in the range of 150 - 300. I have never surpassed 300, and I would suggest that you don't either.

I see that on some of the datasets, it looks like 5000 expected cells might be a little bit high. Ideally, you wouldn't need to fine-tune this parameter... however, it is used to obtain a prior on the expected number of counts in real cells. If there are only a few hundred real cells, but you say that expected cells is 5000, then the algorithm will think that the expected number of UMIs per cell is quite a bit lower than it is in reality. This can cause difficulties in learning cell probabilities accurately. I wonder if, in line with the CellRanger results, a value for --expected-cells of 1000 might be more appropriate for most datasets. Especially the dataset that dips down near 1000 on the x-axis (blue-ish line)... that one might even be in the hundreds.

sjfleming commented 4 years ago

Also, if you're using the current v0.1.0 version, then you can hold out some test data by reaching into the code and changing this line:

https://github.com/broadinstitute/CellBender/blob/20bab467408d6822874dd6dadbd9c368b580721f/cellbender/remove_background/consts.py#L24

That might be a way to double-check and see if overfitting is an issue. A value of 0.9 would mean that 10% of the dataset will be held out as test data. The learning curve will be plotted in the PDF output automatically.

Munfred commented 4 years ago

Awesome thanks so much, I'll try it out with the held out data and see how it goes!

I'd suggest making 10% held out data a default. Thanks!

Munfred commented 4 years ago

Hello, I tried again with expected-cells = 1000, total-droplets-included = 15000 and epochs = 300 and an 80% train/test split, but it still predicts several datasets with 15k cells. From the train/test plots which I'm attaching some below, it is clear that it is overfitting. Perhaps some kind of automatic early stopping would be the best way to deal with this since I guess most people will be tuning parameters. I'm giving it one last shot with 75 epoch training since that seems to be about as good as it gets for most datasets

15000
15000
15000
15000
5691
5246
8849
10143
10705
4903
6656
15000
6948
7686
6015
5055
5665
15000
5436
8442
9271

image image image image image

Munfred commented 4 years ago

Hmm after using only 75 training epochs the results are similar. I'm attaching a sampling of the output pdfs in case it is of interest. At this point I don't feel very confident using the cellbender barcodes, I will have to try out some more manual filtering options

15000
15000
15000
15000
5610
4357
8574
9522
9787
4759
5972
15000
6899
5819
5951
5035
5691
15000
5414
8425
9248

image image image image

sjfleming commented 4 years ago

Hi @Munfred , yes it seems like remove-background is really struggling with your datasets. As you can see by eye for example for the last dataset you showed, there is really no easy way to guess what is a cell versus what is empty... there is no obvious elbow or knee in that UMI plot whatsoever.

An easy dataset might look like this: image

While a "hard" dataset would look like this: image

What you have there is very difficult indeed. Are you sure there hasn't been some sort of quality control failure for these samples? It really does not look like you have many cells at all, at least to my eyes.

sjfleming commented 4 years ago

By the way, the latent space for that "hard" example above looks like this: image

To me, your latent space being a sort of formless blob seems to be another possible hint that there may be some kind of a quality control failure. It's not definitive of course... for some experiments without much cellular diversity, the latent space will tend to just look like a blob.

Munfred commented 4 years ago

Yes I do believe there is something funky with the data I'm working with. It is a single nuclei dataset of C. elegans, with a new dissociation protocol. It seems to have a lot of background (which is odd since nuclei are washed).

Thanks for the help. I'll give another shot with cellbender next time I have a less problematic dataset.