compomics / compomics-utilities

Open source Java library for computational proteomics
http://compomics.github.io/projects/compomics-utilities.html
29 stars 17 forks source link

Core usage by PeptideMapper #34

Closed andrewjmc closed 4 years ago

andrewjmc commented 4 years ago

Hello,

I am using PeptideMapper as part of the Metanovo pipeline (https://github.com/uct-cbio/proteomics-pipelines/). However the parallelisation of the PeptideMapping steps to the number of cores available fails. PeptideMapping appears to use an unpredictable number of threads, and I quickly exceed the limits set by the batch system (PBS). The job is then killed.

I can see no way from the command line to control the use of threads. I have tried to run the command under the control of cpulimit (https://github.com/opsengine/cpulimit). This seems to work well in isolation, but when I am running 280 jobs (array job) it doesn't work and exceeds CPU limits. I wonder if many instances of cpulimit on a single machine struggle to work.

Unless there are any other clever solutions, is there a way of explicitly instructing PeptideMapping to use a certain number of threads?

Thanks,

Andrew

mvaudel commented 4 years ago

Hi,

Thank you for contacting us on this. I had a quick look at the code and if you are using the command line below, it should not be multithreaded - peptides and tags are mapped one after the other. https://github.com/uct-cbio/proteomics-pipelines/blob/master/bin/bash/metanovo.sh#L94

If you see cpu jumping up as RAM gets full my best guess is that java is using additional resources for garbage collection, so you would need to limit the resources given to the jvm, and I am not sure as how to do it on your setup...

I must confess that we wrote this command line as a quick solution for people to use PeptideMapper outside DenovoGUI and PeptideShaker but the implementation can be improved. We will look into it and implement multithreading.

Please let me know if there is anything I am missing,

Marc

andrewjmc commented 4 years ago

Hi Marc,

Thanks for your speedy response! Multithreading could be a perfect solution.

When I track the processor usage running on my HPC login nodes, I find that it frequently sits at ~120%, but jumps up to ~250% at times (especially just after indexing). I too surmised this might be to do with garbage collection and used the command line switch to limit to a single GC thread. Interestingly this didn't seem to make a difference.

I tried using the cpulimit linux application. Though this worked effectively as a single test, it did not seem to work when 280 jobs were run on the cluster in parallel (I wonder whether so many instances can run together when they are often running on the same physical machine).

At the moment, in order to run I need to request 4 CPUs for each job, as (inexplicably) some jobs are being terminated for using more than 4 CPUs. It seemed even worse when I didn't break the large tags.txt (>1 Gb) file into smaller chunks.

This will hopefully do the job, but does mean requesting more compute than I am actually using (the HPC may penalise me for such misbehaviour :-] ) .

Best wishes,

Andrew

dominik-kopczynski commented 4 years ago

Hey, as stated, code is pushed into new_backend branch. Peptide and sequence tags mapping happens now in parallel, you can choose the number of cores with an additional -c parameter at the end of the command.

Example: java -cp utilities-5.X.X-foo.jar com.compomics.cli.peptide_mapper.PeptideMapperCLI -p PATH/TO/FASTA_FILE.fasta PATH/TO/PEPTIDE_LIST.csv OUTPUT.csv -c 4

By default, all available cores will be used. On my computer, there is a performance increase. However, please do not expect n times faster computation when using n cores, since all cores share the same index in the memory. I will close the thread now, but if you have some issues please don't hesitate to reopen this thread again :-) And please tell us how well the modifications perform in the Metanovo pipeline. Cheers