Meaning of count removal per gene warning (v0.3.0)

npklein commented 1 year ago

Thanks for developing this tool!

I am looking through the reports of a couple of my samples, and all of them seem to have the following warning

R2 value for the fit of y=x for removal is 0.xxx
WARNING: This deviates from expectations, and may indicate that the run did not go well

However, I'm not sure what this might indicate (e.g., should I change parameters?), and couldn't find it in your troubleshooting section.

For some samples the other QC plots seem to look good, for example:

R2 value for the fit of y=x for removal is 0.3040
WARNING: This deviates from expectations, and may indicate that the run did not go well

While for others the the learning curve is also not looking good (but here the report gives indication what to try, i.e. lowering learning rate)

R2 value for the fit of y=x for removal is 0.5666
WARNING: This deviates from expectations, and may indicate that the run did not go well

Do you have some additional info on this warning?

Thanks!

sjfleming commented 1 year ago

Hi @npklein , yes I am still working on the automated warnings in the report... I think I am being a bit too aggressive in saying "WARNING: This deviates from expectations ..." Right now I am computing the Pearson R correlation coefficient for that scatter plot. But I might change it a bit, so that it's a more robust fit that weights the highly expressed genes more heavily. You can see (and I've seen the same thing) that those scatterplots, while they're not a perfect y = x line, they are still very correlated. And the idea here is not to get a perfect y = x line. It's just to see whether there is some kind of ballpark correlation between naive expectations of "removing what's in the empty droplets" versus what the tool actually did.

Basically the only corrective action for that plot would be to see if the empty droplets seem to have been identified correctly. There are times when CellBender's automated heuristics can be fooled, and maybe CellBender thinks the wrong part of the UMI curve is empties. In that case, this scatterplot might not look very correlated at all, and it might be a sign that you need to supply --expected-cells or --total-droplets-included input arguments.

As far as the learning curves, the first one looks awesome, and the second one looks not awesome. :) I am actively working on trying to come up with ways to prevent that from happening, but yes, right now the best bet is to reduce the learning rate.

sjfleming commented 1 year ago

You can try --learning-rate 1e-5 --epochs 300 if you want. I know that produces good results for some people, though it does take longer to train!

sjfleming commented 1 year ago

Several tweaks have been made very recently that should hopefully clear this up for you in v0.3.0

Potentially closed by #238

broadinstitute / CellBender

Meaning of count removal per gene warning (v0.3.0) #210