Closed lgeistlinger closed 5 months ago
Thanks, this looks promising. From a quick glance at the intro, it sounds like whatever classifier we want to use must produce probabilities for each class, not just the predicted class. That makes sense. Do all the classification methods we are interested in satisfy this requirement?
Great question. Some do (eg CellTypist), some don't (eg SingleR which rather produces correlation scores for each class). @andrewGhazi has put some thoughts into transforming scores to probabilities as part of his calculateCategorizationEntropy
function here in the package.
Having spent a few days thinking about it, I think that generally handling the transformation of arbitrary scores into representative probability distributions isn't doable. The entropy function optionally applies a global inverse normal transformation (as of yesterday), then applies a softmax if appropriate, which is better than nothing, but the result isn't necessarily the same distribution as what an expert user of the scores would produce.
Also I recall that I saw a talk on this at posit::conf 2023 . Max Kuhn presenting on the probably
package: https://probably.tidymodels.org/ edit: article on regression here: https://www.tidymodels.org/learn/models/conformal-regression/index.html
Has been moved over to https://github.com/ccb-hms/scConform
Both @drisso and @rgentlem have independently pointed me to conformal prediction in the context of cell type annotation as a way to derive uncertainty estimates for the cell type labels coming out of the different classification methods (SingleR, CellTypist, etc). I think we should put some thought into how to provide some functionality here in the package for that as well.
Here is a general introduction to conformal prediction: https://arxiv.org/pdf/2107.07511.pdf Here is an application of conformal prediction to cell type annotation in single-cell data: https://proceedings.mlr.press/v179/khatri22a/khatri22a.pdf Here are some more considerations to conformal prediction in an ambiguous ground truth setting as we also often have for cell type annotation: https://arxiv.org/pdf/2307.09302.pdf
The first paper above has some jupyter notebooks that could be used as a starting point. There are also R packages available in that space that might serve as a starting point:
Maybe also @deepayan and @andrewGhazi have thoughts.