ahmohamed / lipidr

Data Mining and Analysis of Lipidomics datasets in R
https://www.lipidr.org/
Other
27 stars 13 forks source link

What kind of numeric measurements does as_lipidomics_experiment expect/assumes? #38

Closed marcora closed 1 year ago

marcora commented 1 year ago

It is not clear from the documentation what kind of numeric measurements as_lipidomics_experiment expect/assumes. Is it peak area, [Mol], molar fraction [%Mol]?

Surely the distributional properties of the statistical tests may change depending on the type of measurement (e.g., raw vs %).

ahmohamed commented 1 year ago

@marcora You are absolutely correct. lipidr uses limma for statistical testing, which expects log-transformed data to be normally distributed (more accurately negative binomial distribution). In general this assumption fits peak areas measured by MS.

You can always examine the distribution of your measurements, using ggplot2::geom_density before passing it to lipidr, or use plot_samples(type="boxplot"). If your original data is normally distributed, you can tell lipidr to skip log transformation using as_lipidomics_experiment(logged=TRUE).

marcora commented 1 year ago

So what if the lipid values are given by the lipidomic facility as Mol or molar fraction %Mol. I understand i can inspect the distribution but without large sample size and for newbies like me it would be important to provide guidance on how to use lipidr with different kind of inputs and what transformation/normalization to use with each of them.

Just asking to put "lipid" values in the input matrix is the perfect recipe for people inputting whatever values they are given and applying default parameters at risk of getting wrong inferences and inflated pvalues out of it.

That's how statistical tools are used by majority of experimental biologists with poor training in statistical modeling and one of the major causes of irreproducible research.

A bit of documentation in the package regarding the various types of lipid measurements that are more often used, along side guidance on how to analyze them in lipidr would be very helpful in preventing wrong usage of the tool.

ahmohamed commented 1 year ago

Thanks @marcora. Without seeing the data or knowing the generative model / how it was preprocessing, there's no way I can make assumptions on the data distribution. Unfortunately, I've only dealt with peak areas so far, which usually follow a normal(-like) distribution when log-transformed. If you think the documentation need improvement regarding types of lipid measurements, I'm happy to review PR if you can contribute, ideally a vignette with public data (similar to examples on lipidr website https://www.lipidr.org/).

marcora commented 1 year ago

That would be great! It's my first time using lipidr and lipidomic data and therefore it may take me some time, buy I will get back to you here.

With only a small number of samples it's hard to estimate distribution though and was hoping to tap into prior experience.

marcora commented 1 year ago

The lipid numeric values I have are estimates (based on spiked-in standards) of lipid molar concentration (moles of lipid/L of final extract that was analyzed). They also provide me with molar fraction values (Mol%) normalized within each sample.

Which of the two numeric values (Mol or Mol%) I should use? If Mol is the answer, should I log and normalize the values within lipidr?

Thanks

ahmohamed commented 1 year ago

I'd go with Mol, since it's just a scaled peak area, and most likely will follow the same distribution. In this case, I'd log transform it and skip normalization (since it's already normalized?).

As discussed above, if you want to use Mol%, you'd need to somehow determine whether they are normally distributed with/without log-transformation.

marcora commented 1 year ago

Understood! At the end I was able to obtain peak areas from the lipidomic facility I used for my study.