HelenaLC / CATALYST

Cytometry dATa anALYsis Tools
66 stars 31 forks source link

Heatmaps normalisation #245

Closed algebio closed 2 years ago

algebio commented 2 years ago

Hello CATALYST team

This is not an issue but just a question to try to understand heatmaps. I think your input will greatly help me (and the community using CATALYST), but please delete this question if you consider that this is a space for package issues and not general questions.

I'm trying to understand why the first heatmap uses the scaled median expression, the second the median scaled, the third one the normalize frequency, same for DA heatmap, and the DS heatmap uses the z-normalized expression.

I have special interest for the z-normalized expression because I have been asked to produce every heatmap using z-score. Firstly; How could I do it? And secondly; Is there any reason why I shouldn't do it in every heatmap? I guess there is when you don't do it but I need this information if I'm not going to z-score every heatmap.

I know I say this very often but, thanks for this amazing tool, I am a big fan of CATALYST.

Regards Juan

SamGG commented 2 years ago

Hi,

Just my point of view. I think cytoforum would be a better place, as once the issue is closed, the answer is usually forgotten (not searched). Feel free to copy your question to cytoforum, and you will get an interesting range of answers showing the diversity of views.

I think your question is of general interest. Even among "bioinformaticians", the shortcut is to apply a z-score. And I would provocatively ask: row or column z-score? In fact, z-score is a no-brainer solution for differential expression. That's why you were asked to do it, IMHO. But z-score is also inaccurate depending on the objective.

The "Differential discovery" vignette gives precise information about plotExprHeatmap and plotdiffheatmap as you probably know. Because the input values of those functions are not the same, we cannot simply use z-score, especially for intensity values. If we use a no-brainer z-score, we should be cautious with the interpretation. In fact, I use the z-score if I can associate a meaning to the standard deviation of the values or if I have no better transformation to apply to the values. The z-score is my last choice.

The main point to address is how the heatmap will be read/interpreted. If we have defined how to read it, we know how to build it. Below I propose my way (and some alternatives) of building/reading heatmaps which is close to but not exactly what is done in CATALYST.

A heatmap is a color representation of a matrix of values and the color scale is its key. The color scale maps colors and values. The color scale is the global key to reading the color and guessing quickly the underlying trend. By global, I emphasize that the color scale is the same for each marker. In a scale, we typically need the zero and the one unit (or whatever value that makes sense) to calibrate the scale. Both make reading easier and should be defined and used during reading.

What are the values? Values are typically of two types in cytometry: fluo/mass intensity and log fold change. This is a difference to the other omics, in which the values are usually log fold changes because the heatmap is used during the differential analysis (cf 2nd point below).

Once clustering is achieved, we summarize the information per cluster. We compute median intensity of markers used in the clustering (called "type" during the prepData stage, e.g. CD3, which corresponds to lineage markers). This information is shared across all samples and not computed per sample. We count the cells in each cluster for each sample, which leads to the abundance per cluster and per sample. We compute median intensity of markers typically not used in the clustering (called "state", e.g. NFkB, which corresponds to signaling/activation level) per cluster and per sample.

The computations may seem less clear after my presentation. You asked if the z-score could be the unique tool for transforming the data before plotting it in a heatmap. I offered many options, with the z-score being the last one to be used. Everyone should perform the computations and the heatmaps using MS Excel, and everything should be clearer.

"What I cannot create, I do not understand." Richard Feynman

Whatever your transformation choices, say what you do, do what you say.

Hope this helps, Samuel

algebio commented 2 years ago

Dear Samuel

Thank you for your detailed answer, I have read it so many times! I understood that z-score should be the last option and I agree with you that any transformation should be done in Excel to have a clear idea of the calculations. Thanks also for the Feynman's quote (what an inspiration!). I followed your advice and asked this question in cytoforum as well.

I will explain your arguments in my presentation but I still have to find the way to produce the z-score normalised heatmap using my single cell experiment. Ideally with CATALYST but any other R package would be fine. Any suggestion?

Regards Juan

SamGG commented 2 years ago

Dear Juan,

Thanks for your feedback. I hope it was clear enough. There are other points related to building heatmaps, such as the color gradient and the thresholding.

The first step is to extract the data you need, i.e. aggregated data. Helena might give you an easy access to it.

Meanwhile, you might try to extract the matrix displayed by the 2 functions. If I am not wrong, the untested code below should help.

res = plotExprHeatmap( your_parameters, scale = "never" )
mat = res@matrix
# not sure this is needed
mat = as.matrix( mat )
# you could export it as CSV for Excel if you want
write.csv( mat, file = "mat_exprs.csv" )
# clusters should be in row, markers in column
mat_zscore = scale( mat )
# new heatmap using the simple pheatmap package and function
# pheatmap could also scale by row or column, but by default it is off
pheatmap::pheatmap( mat )

Try it and use the help of these functions.

Best, Samuel

algebio commented 2 years ago

Dear Sam

Thank you so much for your help. The code worked well (no error messages) and produced the scaled heatmap.

Regards Juan