ccb-hms / scDiagnostics

Diagnostic functions to assess the quality of cell type annotations in single-cell RNA-seq data
https://ccb-hms.github.io/scDiagnostics/
3 stars 2 forks source link

Conformal prediction #16

Closed lgeistlinger closed 5 months ago

lgeistlinger commented 1 year ago

Both @drisso and @rgentlem have independently pointed me to conformal prediction in the context of cell type annotation as a way to derive uncertainty estimates for the cell type labels coming out of the different classification methods (SingleR, CellTypist, etc). I think we should put some thought into how to provide some functionality here in the package for that as well.

Here is a general introduction to conformal prediction: https://arxiv.org/pdf/2107.07511.pdf Here is an application of conformal prediction to cell type annotation in single-cell data: https://proceedings.mlr.press/v179/khatri22a/khatri22a.pdf Here are some more considerations to conformal prediction in an ambiguous ground truth setting as we also often have for cell type annotation: https://arxiv.org/pdf/2307.09302.pdf

The first paper above has some jupyter notebooks that could be used as a starting point. There are also R packages available in that space that might serve as a starting point:

  1. conformalClassification: https://cran.r-project.org/web/packages/conformalClassification/index.html
  2. conformal: https://cran.r-project.org/web/packages/conformal/index.html (retired)

Maybe also @deepayan and @andrewGhazi have thoughts.

deepayan commented 1 year ago

Thanks, this looks promising. From a quick glance at the intro, it sounds like whatever classifier we want to use must produce probabilities for each class, not just the predicted class. That makes sense. Do all the classification methods we are interested in satisfy this requirement?

lgeistlinger commented 1 year ago

Great question. Some do (eg CellTypist), some don't (eg SingleR which rather produces correlation scores for each class). @andrewGhazi has put some thoughts into transforming scores to probabilities as part of his calculateCategorizationEntropy function here in the package.

andrewGhazi commented 1 year ago

Having spent a few days thinking about it, I think that generally handling the transformation of arbitrary scores into representative probability distributions isn't doable. The entropy function optionally applies a global inverse normal transformation (as of yesterday), then applies a softmax if appropriate, which is better than nothing, but the result isn't necessarily the same distribution as what an expert user of the scores would produce.

andrewGhazi commented 1 year ago

Also I recall that I saw a talk on this at posit::conf 2023 . Max Kuhn presenting on the probably package: https://probably.tidymodels.org/ edit: article on regression here: https://www.tidymodels.org/learn/models/conformal-regression/index.html

lgeistlinger commented 5 months ago

Has been moved over to https://github.com/ccb-hms/scConform