bigbio / nf-workflows

Repository of Nextflow+BioContainers workflows
GNU General Public License v2.0
14 stars 8 forks source link

Pipeline of reanalysis with Xtandem and MSGF+ out of memory. #6

Open ypriverol opened 5 years ago

ypriverol commented 5 years ago

The current pipeline performs the following analysis:

This part of the pipeline works perfectly well. However, would be great if we can obtain a global mzTab with all the PSMS and 1% FDR at PSM level rather than one file per MGF.

I have tried to implement the first iteration of this last process in line:

https://github.com/bigbio/nf-workflows/blob/master/xt-msgf-nf/xt-msgf-nf.nf#L243

But I have found performances issues in PIA when combining millions of PSMs julianu. The last run only allows PIA to add to the compiler 6M PSMs before failing. The amount of memory allocated was 20G.

What do you think @jgriss.

ypriverol commented 5 years ago

We can improve the pipeline by doing:

1- First I will try to export the mzIdentML instead of mzTab, mzIdentML for each combination with 10% FDR at PSM level. We can't generate mztab in that step because mztab do not contain all the information of the experiment and then things get messy. I have tried to modified PIA but It doesn't work. Probably @julianu can help here.

2- Second step we merge only the mzIdentML from that step making the pipeline to merge only 1/2 of files and probably less that 50% of the original PSMs. Last step we generate the mzTab files.

What do you think @jgriss

julianu commented 5 years ago

PIA is a bit memory greedy, but what you could test (and worked on my side to merge a lot of mzIdentML files): First load in each result file alone and only perform a PSM level FDR filtering of about 3%-5%. When writing out these results into mzIdentML again, you will keep the experiment Infos and have much smaller files. These pre-filtered files can then be used to perform a merge of all the relevant PSMs, which consumes much less memory and should generally also be much faster.