Teichlab / celltypist

A tool for semi-automatic cell type classification
https://www.celltypist.org/
MIT License
254 stars 40 forks source link

Feature Request: Store freqs in annotation when majority_vote set to True #99

Closed eroell closed 7 months ago

eroell commented 7 months ago

Hey,

thanks for the work on this nice tool!

Regarding the annotation procedure:

Current state The celltypist.classifier.AnnotationResult's probability_matrix contains the per-cell class uncertainty, I think coming from the logistic regression's softmax output, per cell.

This is an interesting quantity also with respect to e.g. calibration.

Enhancement request When annotating using majority_vote=True, it is still these probabilities being reported in celltypist.classifier.AnnotationResult's probability_matrix I believe. However, here it appears that another interesting uncertainty estimate would be the freq within each cluster - especially as this is also the uncertainty estimate used for the 'Heterogeneous' assignment.

Again, for e.g. calibration, this uncertainty would be an interesting quantity.

Would it be an option to return the per-cell type freq when over_clustering=True? E.g. as additional columns in celltypist.classifier.AnnotationResult's probability_matrix, with a prefix "majority_voting" or so?

Best,

ChuanXu1 commented 7 months ago

@eroell, I believe the feature you requested is a cell type- or cluster-level metadata, rather than cell-level information. You can use the two columns from celltypist.classifier.AnnotationResult.predicted_labels (over_clustering and predicted_labels) to get the freq of each cluster.

eroell commented 7 months ago

Hey, thanks for getting back to this so quickly!
 Yes, this indeed is a cluster-level information (I dont think it is a cell type information?).

Absolutely, these two columns from celltypist.classifier.AnnotationResult.predicted_labels can be used for that.

My line of thought here was that just like when getting labels+probabilities using the default annotation scheme, getting labels+”probabilities” for the majority vote scheme. Yep, these “probabilities” (being frequencies) for the majority vote scheme are cluster-level information: so are the majority vote labels, which are also reported at cell-level.

As you said, this can be gotten with a few lines - the enhancement would be pure convenience.

So in case it seems other people are not interested in this, feel free to close the issue! :)