Wildly variable retained barcode list

broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.

https://cellbender.rtfd.io

BSD 3-Clause "New" or "Revised" License

295 stars 54 forks source link

Wildly variable retained barcode list #73

Closed Munfred closed 3 years ago

Munfred commented 4 years ago

Hello, I just tried out CellBender on 21 datasets of 10x v2 single nuclei data that should be generally comparable (Cell Ranger retains ~1500-3500 barcodes in each) but I got wildly variable results with CellBender.

I used the following parameters: expected-cells = 5000, total-droplets-included = 15000, epochs = 1500.

The number of barcodes retained by CellBender (from the output.csv file) are below. I'm surprised to see do many datasets with the maximum number of 15k, as well as 2 datasets with 0. Could this be a result of overfitting because I did 1500 epochs? In the output.pdf I noticed that only a train score is provided, so it's hard to diagnose overfit...

Here are the knee plots in case they help diagnose things. All datasets are pretty similar.

sjfleming commented 4 years ago

Hi @Munfred , yes I would suspect that this is an artifact of overfitting. Although we're using an unsupervised approach with priors on latent variables that should not be super-susceptible to overfitting, it still is not advisable to train for too many epochs. I have never tested with nearly that many epochs. I typically use something in the range of 150 - 300. I have never surpassed 300, and I would suggest that you don't either.

I see that on some of the datasets, it looks like 5000 expected cells might be a little bit high. Ideally, you wouldn't need to fine-tune this parameter... however, it is used to obtain a prior on the expected number of counts in real cells. If there are only a few hundred real cells, but you say that expected cells is 5000, then the algorithm will think that the expected number of UMIs per cell is quite a bit lower than it is in reality. This can cause difficulties in learning cell probabilities accurately. I wonder if, in line with the CellRanger results, a value for --expected-cells of 1000 might be more appropriate for most datasets. Especially the dataset that dips down near 1000 on the x-axis (blue-ish line)... that one might even be in the hundreds.

sjfleming commented 4 years ago

Also, if you're using the current v0.1.0 version, then you can hold out some test data by reaching into the code and changing this line:

https://github.com/broadinstitute/CellBender/blob/20bab467408d6822874dd6dadbd9c368b580721f/cellbender/remove_background/consts.py#L24

That might be a way to double-check and see if overfitting is an issue. A value of 0.9 would mean that 10% of the dataset will be held out as test data. The learning curve will be plotted in the PDF output automatically.

Munfred commented 4 years ago

Awesome thanks so much, I'll try it out with the held out data and see how it goes!

I'd suggest making 10% held out data a default. Thanks!

Munfred commented 4 years ago

Hello, I tried again with expected-cells = 1000, total-droplets-included = 15000 and epochs = 300 and an 80% train/test split, but it still predicts several datasets with 15k cells. From the train/test plots which I'm attaching some below, it is clear that it is overfitting. Perhaps some kind of automatic early stopping would be the best way to deal with this since I guess most people will be tuning parameters. I'm giving it one last shot with 75 epoch training since that seems to be about as good as it gets for most datasets

Munfred commented 4 years ago

Hmm after using only 75 training epochs the results are similar. I'm attaching a sampling of the output pdfs in case it is of interest. At this point I don't feel very confident using the cellbender barcodes, I will have to try out some more manual filtering options

sjfleming commented 4 years ago

Hi @Munfred , yes it seems like remove-background is really struggling with your datasets. As you can see by eye for example for the last dataset you showed, there is really no easy way to guess what is a cell versus what is empty... there is no obvious elbow or knee in that UMI plot whatsoever.

An easy dataset might look like this:

While a "hard" dataset would look like this:

What you have there is very difficult indeed. Are you sure there hasn't been some sort of quality control failure for these samples? It really does not look like you have many cells at all, at least to my eyes.

sjfleming commented 4 years ago

By the way, the latent space for that "hard" example above looks like this:

To me, your latent space being a sort of formless blob seems to be another possible hint that there may be some kind of a quality control failure. It's not definitive of course... for some experiments without much cellular diversity, the latent space will tend to just look like a blob.

Munfred commented 4 years ago

Yes I do believe there is something funky with the data I'm working with. It is a single nuclei dataset of C. elegans, with a new dissociation protocol. It seems to have a lot of background (which is odd since nuclei are washed).

Thanks for the help. I'll give another shot with cellbender next time I have a less problematic dataset.