Nesvilab / FragPipe

A cross-platform proteomics data analysis suite
http://fragpipe.nesvilab.org
Other
208 stars 38 forks source link

Philosopher filter command is much slower than all other steps (18 days vs a few hours) #948

Closed Skourtis closed 1 year ago

Skourtis commented 1 year ago

Hi!

I'm doing a smalle trial for my experiment where I'm currently running 5000 SILAC DDA raw files through the different modules of fragpipe. MSfragger took about 48hrs to finish for all files (run in smaller batches), but then when I combined all the MSfragger output for the validation step to get a global fdr for all the files, it's being much slower than I anticipated (and slower than MaxQuant which took 11 days for the whole workflow from raw files to ion-level silac quantification). It has now been 20 days on the validation step alone and it is still running (currently just finished writing Philospher reports). Whereas normally the whole workflow (with a smaller number of .raw files (e.g. 10), is much faster than MaxQuant.

I'm running this on a server with 95 cores and 86GB RAM which are all being used, (in the case of MSfragger, extremely efficiently).

Is this expected behaviour? Are there any settings/ intermediate file-writing I can turn off that would speed this up. The actual experiment I need to run is 10x bigger, which means that it might take 200 days to finish. At this point, I am willing to sacrifice some accuracy if it will return a big boost in run time.

Am I doing something wrong? Thank you, Savvas image

anesvi commented 1 year ago

I do not think you need ‘ generate peptide summary’ and ‘protein summary’. I think these reports in Philosopher take all that time. IonQuant will generate its own reports. The data will still be filtered to 1 percent global protein FDR ( but experiment -specific PSM FDR). Maybe you can have a zoom call with us do we can understand how you want to filter the data Alexey

Get Outlook for iOShttps://aka.ms/o0ukef


From: Savvas Kourtis @.> Sent: Monday, January 2, 2023 2:14:38 AM To: Nesvilab/FragPipe @.> Cc: Subscribed @.***> Subject: [Nesvilab/FragPipe] Extremely slow validation module (Issue #948)

External Email - Use Caution

Hi!

I'm doing a smalle trial for my experiment where I'm currently running 5000 SILAC DDA raw files through the different modules of fragpipe. MSfragger took about 48hrs to finish for all files (run in smaller batches), but then when I combined all the MSfragger output for the validation step to get a global fdr for all the files, it's being much slower than I anticipated (and slower than MaxQuant which took 11 days for the whole workflow from raw files to ion-level silac quantification). It has now been 20 days on the validation step alone and it is still running (currently just finished writing Philospher reports). Whereas normally the whole workflow (with a smaller number of .raw files (e.g. 10), is much faster than MaxQuant.

I'm running this on a server with 95 cores and 86GB RAM which are all being used, (in the case of MSfragger, extremely efficiently).

Is this expected behaviour? Are there any settings/ intermediate file-writing I can turn off that would speed this up. The actual experiment I need to run is 10x bigger, which means that it might take 200 days to finish. At this point, I am willing to sacrifice some accuracy if it will return a big boost in run time.

Am I doing something wrong? Thank you, Savvas [image]https://user-images.githubusercontent.com/51754041/210202686-234999fd-04fb-48a9-b77d-a44b528815c1.png

— Reply to this email directly, view it on GitHubhttps://github.com/Nesvilab/FragPipe/issues/948, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIIMM63FWTVYNVDRQ5RCRWDWQJ555ANCNFSM6AAAAAATOSO4EQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>


Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

fcyu commented 1 year ago

Hi Savvas,

Since it is still running, can you copy or export all texts in the "console" panel of the "run" tab, and send it to us? You can upload the file to the issue page or send it to yufe AT umich.edu.

Thanks,

Fengchao

Skourtis commented 1 year ago

Here is the current log output (exported from the console).

The validation module is currently on 'Restoring PSM results' for the last 24 hrs.

Hopefully you can access the link.

Thank you Alexey! I'll definitely turn off the writing of summaries as you suggested, and discuss about a meeting once we see if the log file helps in understanding if I could have done something better.

Thanks!

fcyu commented 1 year ago

Hi Savvas,

Thank you very much for your log file. Following is the summarized run times for the major steps (There is no MSFragger in the log file but according to your description, it took ~48 hours.):

Percolator: ~2.5 hours ProteinProphet: ~2 hours Philosopher init workspace: ~4 hours Philosopher annotate database: ~10 hours Philosopher filter: ~18 days Philosopher report: ~3.5 hours

I didn't include iProphet and Abacus since they haven't finished and can be skipped.

It looks like the Philosopher filter command do take lots of time. Need Felipe @prvst to take a look.

Best,

Fengchao

prvst commented 1 year ago

You're planning to process 50.000 files at the same time?

Can you paste here the log from the filter?

anesvi commented 1 year ago

Are all 5000 files analyzed as a single experiment? Yes it looks like philosopher cannot handle big datasets like that. We need to discuss with Felipe what we can do.

Get Outlook for iOShttps://aka.ms/o0ukef


From: Fengchao @.> Sent: Tuesday, January 3, 2023 10:43:51 AM To: Nesvilab/FragPipe @.> Cc: Nesvizhskii, Alexey @.>; Comment @.> Subject: Re: [Nesvilab/FragPipe] Extremely slow validation module (Issue #948)

External Email - Use Caution

Hi Savvas,

Thank you very much for your log file. Following is the summarized run times for the major steps (There is no MSFragger in the log file but according to your description, it took ~48 hours.):

Percolator: ~2.5 hours ProteinProphet: ~2 hours Philosopher init workspace: ~4 hours Philosopher annotate database: ~10 hours Philosopher filter: ~18 days Philosopher report: ~3.5 hours

I didn't include iProphet and Abacus since they haven't finished and can be skipped.

It looks like the Philosopher filter command do take lots of time. Need Felipe @prvsthttps://github.com/prvst to take a look.

Best,

Fengchao

— Reply to this email directly, view it on GitHubhttps://github.com/Nesvilab/FragPipe/issues/948#issuecomment-1369919370, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIIMM63DB3IKG6K3UD3DBJTWQRCLPANCNFSM6AAAAAATOSO4EQ. You are receiving this because you commented.Message ID: @.***>


Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

fcyu commented 1 year ago

To speed up the turnaround, let me answer these questions for the user since those information can be found from the log file.

You're planning to process 50.000 files at the same time?

I am not sure about other tools, but I think MSFragger, Percolator, and IonQuant can handle that many files in a single task.

Can you paste here the log from the filter?

Here you are philosopher_fiter_log.zip

Are all 5000 files analyzed as a single experiment?

All 5000 files are in 5000 experiments. So, there are 5000 filter commands.

Best,

Fengchao

prvst commented 1 year ago

Thanks.

The filter processing time seems OK. The individual experiments are small, and the filtering is taking between 4 and 5 minutes per experiment, most likely because of the protein inference, which is larger.

The delay happens because you have to run it 5.000 times. I don't see a way to change the algorithm right now to improve that, but I'll take another look at some critical steps. My suggestion is to split the execution so you can process batches of files in parallel.

fcyu commented 1 year ago

Is the filtering command scale linearly with respect to the number of files? If yes, why would splitting 5000 files into smaller batches help since the total run time did not change?

Thanks,

Fengchao

Skourtis commented 1 year ago

Hi everyone! Thanks for looking into this.

@prvst yes. The plan is to reanalyse all available SILAC PRIDE experiments with Fragpipe (20TB), and provide an annotated reanalysis of all this data which have been FDR controlled. This was originally done by my supervisor Georg Kustatscher with MaxQuant for the 5000 raw files I'm trying to process now.

We chose the Fragpipe pipeline because it's extremely fast on small experiments, and I assumed this would be true for bigger experiments.

After consulting @fcyu on a previous question, he suggests that only the validation module would actually need to process all files at once. For MSFragger and IonQuant for example, they are batched process because each PRIDE project used different SILAC heavy light labels, and so these modules only see a subset of the data at a time (maximum 1000 raw files).

@fcyu what do you mean iprophet and acabus can be skipped. Could I turn them off? I'll have a look at what these modules are doing.

fcyu commented 1 year ago

@fcyu what do you mean iprophet and acabus can be skipped. Could I turn them off? I'll have a look at what these modules are doing.

Those two are used to "generated peptide-level summary" and "generated protein-level summary" as mentioned by Alexey, which can be unchecked since IonQuant can generate them later.

Best,

Fengchao

anesvi commented 1 year ago

Hi Savvas,

Could you please email me directly at nesvi at med.umuich.edu

We would have to make changes in the pipeline to be able to speed up the filtering step for the type of project you are doing. But it will require work on our side and it is a one of a kind type of project. So we can only do things like this as a collaboration. If your were interested, email me and we can discuss. Otherwise, unfortunately, we for not have a way now to make the philosopher filter command process to 50,000 experiments in reasonable time.

Best Alexey

Alexey I. Nesvizhskii, Ph. D. Godfrey D. Stobbe Professor of Bioinformatics Department of Pathology Department of Computational Medicine and Bioinformatics Director, Proteomics Resource Facility Director, Proteome Informatics of Cancer Training Program University of Michigan 4237 Med Sci I, 1301 Catherine Road Ann Arbor, MI 48109-0602 USA

Ph: (734) 764-3516 Fx: (734) 936-7361 Email: @.**@.> Lab website: www.nesvilab.orghttp://www.nesvilab.org/ Proteomics Resource Facility: https://www.pathology.med.umich.edu/proteomics-resource-facility

From: Savvas Kourtis @.> Sent: Tuesday, January 3, 2023 11:18 AM To: Nesvilab/FragPipe @.> Cc: Nesvizhskii, Alexey @.>; Comment @.> Subject: Re: [Nesvilab/FragPipe] Philosopher filter command is much slower than all other steps (18 days vs a few hours) (Issue #948)

External Email - Use Caution

Hi everyone! Thanks for looking into this.

@prvsthttps://github.com/prvst yes. The plan is to reanalyse all available SILAC PRIDE experiments with Fragpipe (20TB), and provide an annotated reanalysis of all this data which have been FDR controlled. This was originally done by my supervisor Georg Kustatscher with MaxQuant for the 5000 raw files I'm trying to process now.

We chose the Fragpipe pipeline because it's extremely fast on small experiments, and I assumed this would be true for bigger experiments.

After consulting @fcyuhttps://github.com/fcyu on a previous question, he suggests that only the validation module would actually need to process all files at once. For MSFragger and IonQuant for example, they are batched process because each PRIDE project used different SILAC heavy light labels, and so these modules only see a subset of the data at a time (maximum 1000 raw files).

@fcyuhttps://github.com/fcyu what do you mean iprophet and acabus can be skipped. Could I turn them off? I'll have a look at what these modules are doing.

— Reply to this email directly, view it on GitHubhttps://github.com/Nesvilab/FragPipe/issues/948#issuecomment-1369959998, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIIMM65DVHHUC5HAH6WNLN3WQRGM5ANCNFSM6AAAAAATOSO4EQ. You are receiving this because you commented.Message ID: @.**@.>>


Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

Skourtis commented 1 year ago

Hi Alexey,

That's great and we are very much interested in this collaboration. I've emailed you the details at nesvi at med.umuich.edu but I got a delivery failure automatic email, so I also emailed you at nesvi at umich.edu, and it looks like it was delivered.

Just replying here to make sure you've received it and we can continue the conversation there.

Thank you for your willingness to help! Savvas