Teichlab / celltypist

A tool for semi-automatic cell type classification
https://www.celltypist.org/
MIT License
260 stars 40 forks source link

Can I use log2(TPM) normalized data as input instead log1p w/ 10K scaling factor? #46

Closed antonioggsousa closed 1 year ago

antonioggsousa commented 1 year ago

Hi,

Thank you for developing and maintaining CellTypist. It is a great tool and makes my life much easier.

I'm analyzing log2(TPM) normalized data from a publicly available data set and I was wondering if I could provide this to CellTypist or is there any assumption violated by this?

I know that the software requires log1p (w/ 10K scaling factor) normalized data as input, but for this particularly data set, I don't have access to the count data to normalize it myself and I still would like to run CellTypist.

I know that CellTypist gives an error and exits when such data is not provided, but if I comment that line of code, will the CellTypist assumptions still be valid?

Thanks in advance for any help or advice. Best regards, António

ChuanXu1 commented 1 year ago

@antonioggsousa, you can NOT provide log2(TPM) to CellTypist for prediction, because the model is trained with log1p(TPM/100) data. Assume you have a sparse matrix in .X which is log2(TPM + 1) (per million), you can do like this

adata.X = (np.exp2(adata.X.toarray()) - 1) / 1000000 followed by sc.pp.normalize_total(adata, target_sum = 1e4) and sc.pp.log1p(adata)

antonioggsousa commented 1 year ago

Dear @ChuanXu1,

Thank you for your quick reply.

Your suggestion sounds just great (I completely forgot that I could reverse the normalized counts).

Just a final question/doubt about what you mention:

the model is trained with log1p(TPM/100) data

I read the paper and methods section about CellTypist a long time ago and I don't remember all the details, but I remember that the coefficients (mean and SD) from the reference are used to scale the features/genes of the query before prediction. If so, this does not require that the model being used for prediction was trained with scaled data?

Sorry if my question sounds silly. Thank once again for your help and for developing/maintaining CellTypist. Best regards, António

ChuanXu1 commented 1 year ago

@antonioggsousa, the model was trained with log1p(TPM/100) data - scaling was performed internally and mean and SD were recorded at that time.

antonioggsousa commented 1 year ago

Thank you @ChuanXu1