Nesvilab / philosopher

PeptideProphet, PTMProphet, ProteinProphet, iProphet, Abacus, and FDR filtering
https://philosopher.nesvilab.org
GNU General Public License v3.0
109 stars 18 forks source link

fatal error: out of memory #18

Closed pisistrato closed 6 years ago

pisistrato commented 6 years ago

Hi, I am having an issue processing a huge data set (~1600 files). For convenience I have all the files in the same folder. I was able to successfully process the data with MSfragger and protein/peptide-Prophet (via the MSfragger GUI), but I hit the out of memory at the execution of

philosopher_windows_amd64.exe filter --sequential --mapmods --tag rev_ --pepxml X:\ --protxml X:\interact.prot.xml

I am going to try to run the command on a machine with ~300GB of RAM soon, to see if this would be enough, but just in case it won't work, what would be the best option to get the processing complete?

I was thinking of splitting the data in multiple batch, by creating multiple folder and moving ~ 10 pep.xml files in them, and then run multiple times the philosopher command, something like

philosopher_windows_amd64.exe filter --sequential --mapmods --tag rev_ --pepxml X:\batch1 --protxml X:\interact.prot.xml

philosopher_windows_amd64.exe filter --sequential --mapmods --tag rev_ --pepxml X:\batch2 --protxml X:\interact.prot.xml

Would this work? My concern is that the interact.prot.xml is just one for all the data, is this a problem? I guess, I would then need to run

philosopher_windows_amd64.exe report

for each batch, which will probably return multiple .tsv files. Is there a way to get everything in a single output?

Thanks

prvst commented 6 years ago

@pisistrato

Based on what you are describing I think you should split your files into different folders. When doing that you don't necessarily need to run the commands several times, if you have the latest Philosopher version there with you, you can run the Pipeline command and pass each folder you want to process to your analysis. Since we are still updating the website with more tutorials, there isn't one for the command right now, but it will be in the future.

Basically what you need to do is to organize your data files into different folders and then you can use the configuration file to determine how to process your data (select the commands), the program will run everything for you in an automated way, demanding much less memory from your computer.

Regarding your question of having a single protXML file; for scenarios like this one, having a single protXML will actually benefit you because you will have a single protein inference for the entire experiment, making it easier to do any type of comparisons later. If you check the Abacus command, you will also have a combined protein report for all the files / folders.

pisistrato commented 6 years ago

@prvst

Thanks for the reply. I am going to try out the Pipeline command and see if I manage to get it done. In the meantime we have tried this morning performing the analysis on a much powerful machine (Linux) with some serious amount of RAM and it worked, until we got this error message

INFO[11:17:57] Executing Workspace 20180323                 
INFO[11:17:58] Creating workspace                           
INFO[11:17:58] Done                                         
INFO[11:17:58] Executing Database 20180323                  
INFO[11:17:58] Processing database                          
INFO[11:18:53] Done                                         
INFO[11:18:53] Executing Filter 20180323                    
INFO[11:18:53] Processing peptide identification files      
FATA[12:44:12] Cannot save results, Bad formatgob: encoder: message too big 
INFO[12:44:13] Executing Report 20180323                    
FATA[12:44:13] cannot restore serialized data structures: invalid argument 

Is this because we are attempting to perform the analysis on a Linux machine, using data generated under windows? Or something else? Just to be sure this error is not going to happen also using the Pipeline way you suggested.

On a side note, we also try to modify the paths present in the protXML file from windows based to unix based, but also that did not work.

prvst commented 6 years ago

Yes, the problem is because you moved files from one system to another, you can start a fresh analysis on your GNU/Linux machine or you can copy your Prophet results there, no need to update the paths

pisistrato commented 6 years ago

@prvst

Few feedbacks from my side:

it seems that the error FATA[12:44:12] Cannot save results, Bad formatgob: encoder: message too big is due to the number/size of files philosopher is trying to process. We first try to split the data set (~1300 files) in 4 folders, but this did not help, same error. After splitting the data set in 40 folders philosopher was able to complete the filter task.

Unfortunately we did not managed to use the Pipeline option, as philosopher could not find the interact*.pep.xml files. It looked like it was looking for a interact.pep.xml file, instead of the multiple pep.xml that were generated from the step(s) before.

Bottom line, when using MSfraggerGUI to process a huge data set, and wants to have single output result, one can encounter 2 problems: -out of memory, which can be 'fixed' by moving to a more powerful machine -Bad formatgob, which can be 'fixed' by splitting the files in multiple folders (for us ~30 files/folder)

prvst commented 6 years ago

@pisistrato

Are you running PeptideProphet as well ?

pisistrato commented 6 years ago

@prvst yes, and it went fine --decoy rev_ --nonparam --expectscore --decoyprobs --masswidth 1000.0 --clevel -2

prvst commented 6 years ago

PeptideProphet result will be one or multiple interact files (depends on the parameters you used). Based on the parameters you posted, you should have one interact file for each database search result. If you run Philosopher with the pipeline option, the program will look for a combined interact file, called interact.pep.xml.

For huge or complex data sets I suggest you to work with the programs on the command line, not trough the GUI.

pisistrato commented 6 years ago

Ok, thanks for the explanation. One last question to be sure we are not doing anything wrong as we are not quite sure how the software is using the files.

So, 1300 mzML in one folder, processed by MSfragger which produced 1300 pepXML. These were processed by peptidesprophet ( 1300 interact****.pep.xml), and proteinprophet (just one interact.prot.xml). At this point we split the pep.xml files in 40 folders, while the prot.xml remained in the main folder. Now, whithin each folder we are running the workspace --init again, the database --annotate again, and the filter which are creating some .bin files in the .meta folder. This is going to complete in few hours. Afterwards we will simply run the report in each folder and get the tsv tables. As I am interested in the open search, I care about just the psms.tsv files, which I can then combined to get a single one. Sounds right? Will this "procedure" affect in any (bad) way the results?

prvst commented 6 years ago

I believe that will work fine, but instead of doing that manually, you can just use the pipeline command and have all that automated. Start with the 40 folders clean, put inside them only the corresponding mzML files.

The pipeline command needs a configuration file that can be acquired by running philosopher pipeline --print. Open the config file and check which commands you want to have executed on each folder and set the desired parameters.

After having all in place, run : philosopher pipeline --config philosopher.yaml folder1/ folder2/ folder3/ (...) folder40/.

Make sure to have the Abacus option marked for execution as well, that way you will have one protXML created with all your interact.pepXML files.

pisistrato commented 6 years ago

Done, thanks!