ahmohamed / lipidr

Data Mining and Analysis of Lipidomics datasets in R
https://www.lipidr.org/
Other
26 stars 13 forks source link

Normalization - Log transformation #54

Open semer94 opened 1 year ago

semer94 commented 1 year ago

I am dealing with a lipidomics dataset extracted from MS-DIAL that consists of peak area data that has been normalized using LOESS algorithm. While several lipids showed significant results in univariate analysis from MS-DIAL I cannot reproduce these results. I would like to ask : 1)which variation of T-test is performed and which method is used to adjust P-values in function de_analysis( )
2)which one is considered the reference group de_analysis(lpd, vitE - vitE_SPL, measure = "Area", group_col = "Group") here 3)what is the base of logFC obtained in the results (I assumed e) 4)if you have any suggestions on modifying the data , e.g. log transformation or some other type of normalization 5)how do the functions set_logged and set_normalized work, i.e. what values does the argument "val" need

With respect

ahmohamed commented 1 year ago

Hi @semer94,

Thanks for submitting your questions as an issue.

  1. lipidr uses limma moderated t-test, which is very popular in gene expression analysis. The data should be a) normally distributed and b) normalised.

Raw peak areas from MS needs to be log-transformed to make them normally distributed. Normalisation can be done with various methods as you wish, and each has their own requirements / assumptions.

Depending on your input data, you can skip some of these steps. Log-transformation is not needed if the data already scaled, pre-logged, or otherwise follow a normal distribution. Similarly, you don't need to re-normalise your data if that has been already done.

So in your case: I assume you export a numerical matrix from MS-DIAL then:

# log the data is not logged
# Skip if already logged!
assay(d, "Area") <- log2(assay(d, "Area"))
set_logged(d, "Area", TRUE)

assay(d, "Area") <- limma::normalizeCyclicLoess(assay(d, "Area"))
set_normalized(d, "Area", TRUE)

Note the use of set_logged and set_normalized to indicate that the "Area" is now logged and normalised. Also, LOESS-based normalisation generally requires normal distribution (so needs to be pre-logged).

  1. The general convention is de_analysis(treatment - control) (treatment minus control), since you're usually interested in changes in the treated group compared to control. Subtracting the control accomplishes this.
  2. The logFC is the (roughly) difference between group means (mean abundance in treatment - mean abundance in control). Since the data is in the log-space, it's called log-fold change.
  3. Answered above. In general I trust the moderated t-test since they are proven to be more robust. Obviously, nothing supersedes validated results.
  4. Answered above.

Hope this helps. Let me know if you have other questions. Otherwise feel free to close the issue.

semer94 commented 1 year ago

Thank you for your immediate response. Another question that occured is why log2 transform and not log transform? I mean since the results are logFC and not log2FC. So if I want to calculate fold change , is this done by exp(logFC) ? Finally , a question regarding lipid names, how should SM 16:1;O2/24:1 and SM 18:2;O2/22:0 be renamed in order to be parsed by lipidr ?