Teichlab / celltypist

A tool for semi-automatic cell type classification
https://www.celltypist.org/
MIT License
304 stars 45 forks source link

celltypist before/after batch correction #119

Open malonzm1 opened 6 months ago

malonzm1 commented 6 months ago

Hi,

I perform batch correction using scVI. But I perform celltypist prediction before batch correction. Is it better to perform celltypist after batch correction or it doesn't matter?

Good day.

ChuanXu1 commented 6 months ago

@malonzm1, predicted_labels is only dependent on gene expression matrix, but majority_voting will be influenced by the neighborhood graph if it is constructed from scVI latent space.

malonzm1 commented 6 months ago

Thanks!

malonzm1 commented 6 months ago

Is majority_voting more reliable if celltypist is run after batch correction?

ChuanXu1 commented 6 months ago

@malonzm1, depends, but majority_voting is usually more readable.

smallsmalltown commented 5 months ago

@ChuanXu1 Based on what you've described, it seems that batch effects will not impact the predicted_labels, but they can influence the majority_voting results??? After applying harmony to remove batch effects, my data also encountered the issue of "Invalid expression matrix in .X, expect log1p normalized expression to 10000 counts per cell; will use .raw.X instead."

ChuanXu1 commented 5 months ago

@smallsmalltown, as I remember, Harmony will not change the expression values but produce only the corrected latent space. To predict your data using CellTypist, you need to provide a normalized gene expression in either .X or .raw.X.

Flu09 commented 3 months ago

@ChuanXu1 Can you explain more about the latent space idea and harmony?. If I integrated using harmony in R then converted my object to h5ad then provided celltypist with the normalized .X of it, what would be better predicted_labels or majority voting? will celltypist use the latent space of the samples at all?

ChuanXu1 commented 3 months ago

@Flu09, celltypist does not use the latent space to predict cell types, namely, the predicted_labels is independent from the latent space. The majority_voting however may be impacted by the latent space as the majority voting result relies on the clustering, which is influenced by the latent space.

Flu09 commented 3 months ago

I see thank you but if i will combine two studies and i noticed that the overall counts in one study are fewer than the other. should the annotation by celltypist be done on each study alone.

ChuanXu1 commented 3 months ago

@Flu09, it's safer to do this for each dataset separately to ensure sufficient gene overlap between your data and the model used.