broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
279 stars 50 forks source link

CellBender Remove-Background: Exploring Hyperparameter Effects #283

Open unikill066 opened 11 months ago

unikill066 commented 11 months ago

We've conducted multiple rounds of CellBender's remove-background process on 8 samples, each time adjusting the hyperparameters. Interestingly, we've noticed that the outcomes appear to be similar when using different parameter values, but they can be quite distinct when employing similar hyperparameters.

For instance, when comparing the default parameters and the "_4's" parameters , the parameters are similar, but the resulting outcomes are strikingly different. Conversely, with the remaining hyperparameter configurations, despite their differences, the results appear remarkably consistent.

Similar params example: cellbender_hyperparameters_testing\0_default_run\4 and cellbender_hyperparameters_testing\4_4, for reference please follow the metadata file for each run.

We are interested in understanding the reason behind this behavior, particularly regarding whether there is a specific mechanism or feature that contributes to the reproducibility(such as a random seed)..?

I have uploaded all the files to the box.

sjfleming commented 11 months ago

Interesting to see these kinds of tests. I'm glad to hear the

similar when using different parameter values

part, but less glad to hear

they can be quite distinct when employing similar hyperparameters

First, how do make the assessment that "the results appear remarkably consistent"? What are you looking at to determine consistency of the output?


For the specific case you highlighted:

I do think I might know the answer. The original run 0_default_run\4 seemed to have a problem converging on a good solution. I say that because of the learning curve:

image

The dip at the end is indicative of the learning process being "thrown off" into an undesirable local minimum. This can happen if the learning rate is too large, and stochastic gradient descent sort of throws the solution too far off into a different part of the parameter space that is suboptimal. (As a side note, I think the default learning rate might be a bit too high, and I might reduce the default in the future, see https://github.com/broadinstitute/CellBender/issues/276#issuecomment-1712009700)

On the other hand, the parameters for 4_4 look like they had a reduced --learning-rate (a good idea), and so the solution seems to have converged to a good solution. I say that because it looks like the learning curve is nearly monotonically increasing to a stable value:

image
sjfleming commented 11 months ago

There is a random seed that cellbender uses to make sure that, if you re-run the tool on the same data with the same input parameters, you'll always get the same output: https://github.com/broadinstitute/CellBender/blob/4990df713f296256577c92cab3314daeeca0f3d7/cellbender/remove_background/run.py#L62-L65