Teichlab / celltypist

A tool for semi-automatic cell type classification
https://www.celltypist.org/
MIT License
284 stars 45 forks source link

Very low confidence score even though labels are correct #118

Closed mousepixels closed 5 months ago

mousepixels commented 5 months ago

I am about to potentially recommend this tool to thousands of people (@sanbomics) but I need clarification on one issue I have been having.

I made a custom model:

ref_model = celltypist.train(rdata, labels = 'CellType', n_jobs = 22, use_SGD = True, max_iter = 100, feature_selection = True, top_genes = 300)

And I annotated like this:

predictions = celltypist.annotate(adata, model=ref_model, majority_voting=False) predictions_adata = predictions.to_adata() adata.obs["ref_label"] = predictions_adata.obs.loc[adata.obs.index, "predicted_labels"] adata.obs["ref_score"] = predictions_adata.obs.loc[adata.obs.index, "conf_score"]

I am getting crazy low conf_score values for most cells. Like 3e-249 or just 0.

However, I know for sure that many of the transferred labels are correct. And the cells are projected in similar space (if the probabilities were close to 0 what is the chance they aren't just distributed randomly?). See attached UMAPs for example.

I KNOW the projected labels are accurate for many of these, e.g., T cells, NK, B, etc. But the confidence score is 0?

I really like your tool, but I need to find an explanation for this before I feel comfortable recommending it.

Thanks for any help!!

image image

ChuanXu1 commented 5 months ago

@mousepixels, I noticed you used SGD for training. SGD is faster as compared to other solvers, but sometimes needs more parameter tuning (e.g., train-test-split-based cross validation for finding the best parameter combo); otherwise, the probability/confidence score will be biased in some cases, either ~0 or ~1.

If you need a quick model without spending time tuning the parameters, you can turn off SGD by use_SGD = False to get a model that is slower in training but more interpretable. Note the first round of training for selecting top features (when feature_selection =True) will always be SGD based.

mousepixels commented 5 months ago

Amazing. That solved the issue completely and really didn't add more than like 20 more seconds to training. Thanks!