Incongruent results - Githubissues

svalvaro commented 3 years ago

Hi @anupshah14, I have realized that if you can obtain different results (without changing any parameter obviously) just by pressing Start analysis again. These are a few screenshots just after pressing Start Analysis with the demo data. I believe this is an issue related to the DEP library.

Screenshot from 2021-04-12 16-31-52

anupshah14 commented 3 years ago

Hi @BioAlvaro,

This observation is due to the general workflow used for LFQ-based proteomics data analysis and underlying variance in individual biological replicates.

In LFQ-Analyst and DEP, similar to other workflows like Perseus, a number of pre-processing and data filtering steps has been performed before statistical analysis to generate the list of DE proteins. The details are here : https://tinyurl.com/LfqAnalystDocs

The DDA-based LFQ data has a known missing values issue. Firstly, we remove the observations with high proportion of missing values. Second step is to impute observations with small proportion of missing values before the statistical analysis. In this step each time you run the analysis, random values are being picked from the left-centered sample distribution.

Because of that, the fold changes and p-values will slightly shift in either direction of the cutoffs (Fold change and adjusted p-value) used to get the numerator number (738 Vs.772 Vs. 774 Vs. 781). Also, sometimes multiple hypothesis correction step results in slight variation in p-values. So although the total number of quantified proteins remains constant (2389 here), the number of altered proteins changes slightly.

Therefore, to further assists with the data interpretation, additional "imputed" and "num_NAs" columns have been provided in the result table. This will help to guide if the protein intensity values are imputed for not.

Hope that helps.

svalvaro commented 3 years ago

Hi @anupshah14 ,

That is what I thought, glad that you confirmed it.

However, your answer raises even more questions: I was wondering if someone runs the analysis just once, they might lose/miss certain significant proteins due to the software choosing random values when imputing.

So would it be a good idea that the software, after pressing start analysis, does the statistical analysis x number of times, and aggregates all the differentially expressed proteins?

Thanks for your detailed answer,

Best.

anupshah14 commented 3 years ago

Hi @svalvaro ,

Imputation is still debatable in the proteomics field. Typically the assumption in DDA based LFQ datasets is missing values are Missing Not At Random (MNAR type), therefore imputing random values from low intensities will add enough variability in the dataset to assist downstream differential expression test (t-test or others). But sometimes missing values could be found in highly expressed proteins as well. I think the results with any imputation should be taken with a grain of salt and I would visualise into raw intensities and peptide-level data to make informed estimate. That is an excellent suggestion to use ensemble method by running statistical analysis x times! I have not tried it yet, but I think it will be complex to implement because it is not clear to me which fold change and p-values to report in that case. Any ideas are welcome.

Regards

svalvaro commented 3 years ago

Hi @anupshah14 , Very interesting points. Thanks for the great answer.

Regards

MonashBioinformaticsPlatform / LFQ-Analyst

Incongruent results #8