Nesvilab / FragPipe

A cross-platform Graphical User Interface (GUI) for running MSFragger and Philosopher - powered pipeline for comprehensive analysis of shotgun proteomics data
http://fragpipe.nesvilab.org
Other
178 stars 37 forks source link

Threading of DIAUmpire across multiple cores when run in FragPipe on Linux vs Windows? #1300

Closed jossmi closed 9 months ago

jossmi commented 9 months ago

We are running DIA-Umpire in FragPipe on a Linux-based computing cluster and are seeing what seems like significantly longer than normal run times compared to a Windows machine. We have specified 16 threads in the FragPipe settings, but I was wondering if maybe there is something we are missing and it is running one at a time or something, or if there are Linux-vs-Windows implementation settings differences we have missed. I've attached the manifest and workflow files, as well as the shell script (".sh") and command line output (".sh.o") files used during submission to the Linux Fedora Sun Grid Engine cluster. I'm also pasting below the command my colleague used to submit the batch:

qsub -pe local 16 -R y -l mem_free=10G,h_vmem=10G -m e -M sburke24@jh.edu run_fenna_fragpipe.sh

Thanks, Josh

exposomics_workflow_workflow.txt exposomics_manifest_fp-manifest.txt run_fenna_fragpipe_sh_o3754679.txt run_fenna_fragpipe_sh.txt

fcyu commented 9 months ago

Hi Josh,

DIA-Umpire (and other modules) in FragPipe were written in Java. The same codebase is used for both Windows and Linux since Java is a cross-platform language and has the "write once, run anywhere" philosophy. So, there is no difference in terms of the programming. But, we do see speed difference with different operating system and Java runtime versions. But, I don't think there is much we can do since it is operating system-level question.

Best,

Fengchao

mremachine1 commented 9 months ago

Hey fengchao, it appears that diaumpire processes each file in series, is there a way to make this parallel since the operations are independent of each other at this point?

fcyu commented 9 months ago

But a single DIA-Umpire instance is running in parallel. Since it normally take several minutes to process one file, I don't think there is much difference between running one instance with multi-threads VS running multiple instances with one-thread. Furthermore, it is not a good idea running multiple instances in parallel because we have to make sure the memory footprint is in a reasonable range.

Best,

Fengchao

mremachine1 commented 9 months ago

Thanks fengchao. Since our diaumpire is taking much longer would it be reasonable to supply more available cores to the job and how does processing time increase with threads supplied?

jossmi commented 9 months ago

Thanks Fengchao. So as a follow-up question, in general terms, what would be the relationship between number of cores and DIA-Umpire run speed? Eg, is it linear and speed would increase proportionally with as many cores as we request? Or is there a non-linear plateau somewhere where the utility of threading on multiple cores will max out? I understand this is fairly theoretical/would probably need empirical testing, but we are trying to get an idea of the best way to speed up performance while maintaining efficiency with the computing resources we request/pay for.

Josh

fcyu commented 9 months ago

I don't think we tested it for DIA-Umpire. As you also know, it is hard to get the conclusion from theoretical analysis. And it most like also depend on the operating system and Java version. Feel free to perform the test by yourself.

Best,

Fengchao