Information on dimensionality reduction and scaling

LukasHats commented 1 month ago

Description of feature

Dear @marcovarrone,

Thanks for providing CellCharter! Concerning the CODEX tutorial, could you give a suggestion on what dimensionality reduction method should be used, is a standard sc.pp.pca enough or is trVAE recommended (which seems to be a bit deprecated at least the repository points towardscarches? Further, why do you scale per image? Is this necessary or generally necessary for marker-based neighborhood detection? What would happen if raw values or e.g. z-score normalized values are used?

Thanks!

marcovarrone commented 1 month ago

Hi @LukasHats, good question! Based on some quick tests done on scRNA-seq, PCA was working but more complex and informative embeddings were improving the results quite a bit.

The original trVAE repository is deprecated but not it's implementation inside scArches, and we are using that one as you can see here.

Scaling is important when you use PCA or trVAE to make sure that every feature is weighted equally. I would not suggest scaling per sample anymore, so of you scale, do it across all samples.

Remember that if you don't have batch effects and you don't have 100s and 100s of markers, you could also try running CellCharter without dimensionality reduction!

LukasHats commented 1 month ago

Hez @marcovarrone,

thanks a lot for the quick answer and the insights, much appreciated! 3/4 follow-up questions: 1) Would scaling be used if no dimensionality reduction is performed? 2) If batch effects are present, we address this only by the dimensionality reduction right? 3) Would you share your comment on this approach: For IMC images, people often use z-score normalized marker expressions to remove batch effects etc. Would it be an idea to put in Z-score values in CellCharter without dimensionality reduction? 4) If I understood you correctly, scale across all samples means just running sc.pp.scale on the complete adata as classically performed in scanpy?

Thanks a lot for help, excited to run it soon!

marcovarrone commented 1 month ago

Hi @LukasHats,

In theory it shouldn't be necessary. The clustering is done using a Gaussian Mixture Model that shouldn't be sensitive to different scales of the features because it fits a mean and a variance value for each feature. However, sometimes I obtained slightly different results with and without scaling, but I never had the time to understand it this was because of the randomness that all spatial domain identification methods have (including CellCharter).
If you are interested in protein-based experiments there is no well-established method for batch correction. For now, the best approach I found is to use trVAE with/without batch correction. As long as a batch correction method returns embeddings, then it should be usable as a batch correction method for CellCharter.
I am not an expert in IMC. I would say intuitively that z-score normalization across the whole dataset doesn't correct any batch effect, because it doesn't change any relationship between features from different samples/batches. If you do z-score normalization for each sample/batch, you will for sure remove batch effects, but it also risks removing a lot of biological variation if the cell populations are very different between batches. That's why I discouraged using it. In our CODEX samples, the populations were not that different between images so it wasn't a problem (I would still not do it in insight).
Yes, correct!

Let me know what you think about it :)

LukasHats commented 1 month ago

Thanks so much, also for the excursus on batch effect removal. Will close for now and open if I encounter problems.

CSOgroup / cellcharter

Information on dimensionality reduction and scaling #54

Description of feature