broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
295 stars 54 forks source link

Training doesn't converge with 300 epochs #42

Closed cnk113 closed 4 years ago

cnk113 commented 5 years ago

Hello,

I've been using CellBender for my datasets, and I noticed that the training loss for one of my dataset exhibits weird behavior. Shoud I try training more? I noticed the cell calls look clean... I've attached the plots. The params are 300 epochs, 1000 layer dim, and 300 latent dim with 10000 expected cells and 40000 total.

Thanks avm049_2_out.pdf

cnk113 commented 5 years ago

avm049_2_out.pdf Hmm, so i did 1000 epochs and the cell calls have definitely shifted...

sjfleming commented 5 years ago

Hi @cnk113,

This is a perfect illustration of what can happen if the auto-encoder neural network in CellBender is over-parameterized. That is to say, when the "layer dim" and "latent dim" are too large. The data are unable to fully determine the (approximately 100 million) parameters of such a large network, and so the loss function exhibits this weird (and bad!) behavior.

If you ever see that your loss function is not (almost always) decreasing monotonically, then there is a large likelihood that the culprit is too large a "layer dim" and "latent dim".

Long story short, I would try --z-dim 20 --z-layers 500 and see what that looks like. In my testing experience on over 100 datasets, 300 epochs of training has always been enough to be considered converged. Usually, convergence happens between 200 and 300 epochs, but the output should not be very sensitive to the exact number of epochs chosen.

The reason that you see the cell calls shift when you go out to 1000 epochs (I had never tried so many!) is that, with such a powerful auto-encoder, the network has the power to re-create (memorize) even the tiny differences between some of the empty droplets, and since it can memorize those details so well, the model decides that they are not ambient background, but instead these memorized "cells" (that are really empty). This is also a consequence of an over-parameterized auto-encoder.

You can probably also use fewer total droplets than 40,000 if you want the algorithm to run faster, depending on how many droplets you need to include about 10,000 "surely empty" droplets (maybe you could do 30,000 in your case...?).

cnk113 commented 5 years ago

Ah I see. I believe in the paper, and the docs it was stated the z-layers at 1000 and z-dim at 200 should be used for minimal/no imputation. Should I still then run it with z-dim at 20 and z-layer 500 if I want no imputation?

Thanks

sjfleming commented 5 years ago

Yes, I would definitely try --z-dim 20 and --z-layers 500. That guideline of 200 and 1000 is a ballpark estimate that typically works, but is on the high end. For some datasets, we have found it to be a bit too high. (We may reduce the recommended numbers in the future...). The idea for the least imputation is to push z-dim and z-layers as high as you can before what you saw starts to happen.

sjfleming commented 5 years ago

If you see that --z-dim 20 and --z-layers 500 results in a learning curve that looks fine, you can definitely try to push the numbers higher, but stop before you see this pathological behavior of the learning curve that indicates a problem.

sjfleming commented 4 years ago

Just to add: once version 2 of remove-background comes out in a few months, these sorts of problems will no longer exist.

cnk113 commented 4 years ago

Would there be preset parameters for different levels of imputations?

sjfleming commented 4 years ago

In version 2, the model will change so that there is no imputation at all. Choices of z-dim will influence only the prior on cell counts, but will not have as large an effect on the posterior.