ftwkoopmans / msdap

MS-DAP: downstream analysis pipeline for quantitative proteomics
GNU General Public License v3.0
29 stars 6 forks source link

Way to handle missing value? #21

Closed ht-lau closed 10 months ago

ht-lau commented 11 months ago

Thank you for this tool.

I want to know if you have any recommendations for how to handle missing value? I understand that imputation is not the ideal way. However, it is hard to find that those condition specific proteins are not flagged as significant.

Will you be able to update msqrob to msqrob2, I think their count model can be a good alternative to imputation.

HT

ftwkoopmans commented 11 months ago

I want to know if you have any recommendations for how to handle missing value? I understand that imputation is not the ideal way.

Indeed, we found that imputation solves some issues but also introduces its own bias and consequently we found no benefits of imputation in 2-proteome benchmark datasets that we generated (i.e. more challenging real-world test of small spike-in differences).

However, it is hard to find that those condition specific proteins are not flagged as significant.

This was also our observation; stringent filtering that retains only peptides found in N samples per condition results in robust differential expression analyses but the downside is that condition-specific proteins (e.g. classic wildtype-knockout setting) are not among the significant proteins

Our solution is to perform differential expression analysis as per usual (e.g. MSqRob or DEqMS), then separately test for "differential detection" to find proteins with many more peptide detections between experimental conditions, and finally merge the top-hits from this "differential detection" with the results we obtained from MSqRob/DEqMS/etc. Please take a look at this documentation to see if this'll help you. Please make sure you use the latest version of MS-DAP. Also, note that the analysis_quickstart() function has some new parameters with the latest version, check the main readme for instructions on updating & reference of our recommended default settings.

This is not a perfect solution obviously and in our lab we tend to err on the safe side and only take the really strong "differential detect" proteins into account, but it works quite well in our hands. (typically z-score cutoff of at least 5, check the plot_differential_detect() figures as shown in the online vignette)

Will you be able to update msqrob to msqrob2, I think their count model can be a good alternative to imputation.

I agree, the hurdle model is an interesting angle to the mixture of MCAR and MNAR missing values in proteomics. In our hands it didn't outperform the original MSqRob though when I tested it last year on spike-in benchmark datasets. However, I did not spend a lot of time on this and it's been almost a year, so if there's a paper that performed proper analyses I'd be happy to take a closer look at that and investigate how to integrate it into MS-DAP.

ht-lau commented 11 months ago

Thank you for your reply.

I generally do impute, however, I also have to downshift a lot (if I draw from a population of number like the perseus workflow) to get those condition specific proteins. And I don’t like it.

I will give the differential analysis a try.

I also want to hijack my own thread, for remove_ambiguous_proteingroups if I set it to false will the follow situation

peptide A, GRIA1; GRIA2 peptide B, GRIA1

result in 1 protein or 2 proteins?

Thanks again

ftwkoopmans commented 11 months ago

remove_ambiguous_proteingroups = FALSE will yield both proteingroups in your example.

remove_ambiguous_proteingroups = TRUE will remove all proteingroups that contain mappings to multiple distinct genes. In your example, only the "GRIA1" proteingroup would remain.

Note that this is a post-hoc filter of the statistics result table (e.g. MSqRob output), there is no re-analysis of everything with those specific peptides removed or anything like that. It was added due to popular demand; a few users asked for this because they always remove ambiguous proteingroups from their results and retain only 1 proteingroup per unique gene (this function also takes care of multiple proteingroups that map to the same gene, i.e. isoforms, by retaining the 1 proteingroup per unique gene that has the strongest p-value).