Heatmaps normalisation - Githubissues

algebio commented 2 years ago

Hello CATALYST team

This is not an issue but just a question to try to understand heatmaps. I think your input will greatly help me (and the community using CATALYST), but please delete this question if you consider that this is a space for package issues and not general questions.

I'm trying to understand why the first heatmap uses the scaled median expression, the second the median scaled, the third one the normalize frequency, same for DA heatmap, and the DS heatmap uses the z-normalized expression.

I have special interest for the z-normalized expression because I have been asked to produce every heatmap using z-score. Firstly; How could I do it? And secondly; Is there any reason why I shouldn't do it in every heatmap? I guess there is when you don't do it but I need this information if I'm not going to z-score every heatmap.

I know I say this very often but, thanks for this amazing tool, I am a big fan of CATALYST.

Regards Juan

SamGG commented 2 years ago

Hi,

Just my point of view. I think cytoforum would be a better place, as once the issue is closed, the answer is usually forgotten (not searched). Feel free to copy your question to cytoforum, and you will get an interesting range of answers showing the diversity of views.

I think your question is of general interest. Even among "bioinformaticians", the shortcut is to apply a z-score. And I would provocatively ask: row or column z-score? In fact, z-score is a no-brainer solution for differential expression. That's why you were asked to do it, IMHO. But z-score is also inaccurate depending on the objective.

The "Differential discovery" vignette gives precise information about plotExprHeatmap and plotdiffheatmap as you probably know. Because the input values of those functions are not the same, we cannot simply use z-score, especially for intensity values. If we use a no-brainer z-score, we should be cautious with the interpretation. In fact, I use the z-score if I can associate a meaning to the standard deviation of the values or if I have no better transformation to apply to the values. The z-score is my last choice.

The main point to address is how the heatmap will be read/interpreted. If we have defined how to read it, we know how to build it. Below I propose my way (and some alternatives) of building/reading heatmaps which is close to but not exactly what is done in CATALYST.

A heatmap is a color representation of a matrix of values and the color scale is its key. The color scale maps colors and values. The color scale is the global key to reading the color and guessing quickly the underlying trend. By global, I emphasize that the color scale is the same for each marker. In a scale, we typically need the zero and the one unit (or whatever value that makes sense) to calibrate the scale. Both make reading easier and should be defined and used during reading.

What are the values? Values are typically of two types in cytometry: fluo/mass intensity and log fold change. This is a difference to the other omics, in which the values are usually log fold changes because the heatmap is used during the differential analysis (cf 2nd point below).

Once clustering is achieved, we summarize the information per cluster. We compute median intensity of markers used in the clustering (called "type" during the prepData stage, e.g. CD3, which corresponds to lineage markers). This information is shared across all samples and not computed per sample. We count the cells in each cluster for each sample, which leads to the abundance per cluster and per sample. We compute median intensity of markers typically not used in the clustering (called "state", e.g. NFkB, which corresponds to signaling/activation level) per cluster and per sample.

The intensity of lineage markers is used to annotate the clusters, which requires to conclude for each marker whether the median intensity of a cluster is high or low. Then the expert could assign a '-', 'mid', '+' or '++' tag to the marker. Low values are at zero, and zero is considered the same for all markers (no molecule, zero intensity). High values depend on the marker, because labeling, titration... are different for each marker. In order to use the same color range for each marker, the median marker intensity of clusters is divided by the highest median intensity. With such a marker-per-marker transformation, the lowest value is zero and the highest is one. There are alternatives such as using minimum and maximum or quantiles. Or we could directly represent the asinh transformed intensity using an open scale. By reading the numerical values attached to the color scale and looking at the color range of each marker, it is possible to guess whether some kind of min to max transformation has been applied.
The log fold change is used to highlight differences. Differences of what? Mainly abundance, but it could also be intensity of "state" markers. Let's talk about abundance without loss of generality. Abundance per cluster and per sample should be first log tranformed, which reduces the dynamic range, stabilizes the variance, and makes the difference symmetric. Because the primary interest within a cluster is the variation of the abundance across samples (more precisely, between groups of samples), the abundance is centered. The average abundance across all groups is typically subtracted. Alternatively, the average of the control group is used instead. This computation leads to the log fold change stricto sensus. If we stop here, which is my usual option, the heatmap shows which cluster has the highest variation between groups of samples. The color scale is absolute in the sense that no scaling per cluster has been done. But scaling is usually carried out, using the standard deviation of the abundance across all samples, which leads to the z-score. I dislike this computation, because this "standard deviation" merges the variation between groups as well as the variation within groups. The latter is the "real" dispersion of the abundance, also called within variation, whereas the former is the total variation. Thus, when between group variation is high, the abundances will be highly penalized (i.e. divided by a high value). It does not make sense to me. The statistical test we could apply (ANOVA or Student) will really consider the between group variation (aka signal) divided by the within group variation (noise). Hopefully :-). The term abundance could be replaced by intensity of "state" marker. Of course, the log should not be applied on asinh transformed intensities, as they are already transformed. Of course, there is a problem with zero abundance when computing the log. It is secondary compared to the main computation described above.

The computations may seem less clear after my presentation. You asked if the z-score could be the unique tool for transforming the data before plotting it in a heatmap. I offered many options, with the z-score being the last one to be used. Everyone should perform the computations and the heatmaps using MS Excel, and everything should be clearer.

"What I cannot create, I do not understand." Richard Feynman

Whatever your transformation choices, say what you do, do what you say.

Hope this helps, Samuel

algebio commented 2 years ago

Dear Samuel

Thank you for your detailed answer, I have read it so many times! I understood that z-score should be the last option and I agree with you that any transformation should be done in Excel to have a clear idea of the calculations. Thanks also for the Feynman's quote (what an inspiration!). I followed your advice and asked this question in cytoforum as well.

I will explain your arguments in my presentation but I still have to find the way to produce the z-score normalised heatmap using my single cell experiment. Ideally with CATALYST but any other R package would be fine. Any suggestion?

Regards Juan

SamGG commented 2 years ago

Dear Juan,

Thanks for your feedback. I hope it was clear enough. There are other points related to building heatmaps, such as the color gradient and the thresholding.

The first step is to extract the data you need, i.e. aggregated data. Helena might give you an easy access to it.

Meanwhile, you might try to extract the matrix displayed by the 2 functions. If I am not wrong, the untested code below should help.

res = plotExprHeatmap( your_parameters, scale = "never" )
mat = res@matrix
# not sure this is needed
mat = as.matrix( mat )
# you could export it as CSV for Excel if you want
write.csv( mat, file = "mat_exprs.csv" )
# clusters should be in row, markers in column
mat_zscore = scale( mat )
# new heatmap using the simple pheatmap package and function
# pheatmap could also scale by row or column, but by default it is off
pheatmap::pheatmap( mat )

Try it and use the help of these functions.

Best, Samuel

algebio commented 2 years ago

Dear Sam

Thank you so much for your help. The code worked well (no error messages) and produced the scaled heatmap.

Regards Juan

HelenaLC / CATALYST

Heatmaps normalisation #245