Closed bebosudo closed 5 years ago
On a single machine the engine use the multiprocessing library to distribute across the cores. It could be that SLURM is interfering with multiprocessing. We cannot tell you because we do not have access to a SLURM cluster. So, if you want help on this, give us access to a SLURM cluster :-)
PS: this is question for the mailing list, not an issue for GitHub, please repost on https://groups.google.com/forum/?hl=en&pli=1#!forum/openquake-users
Thanks for the reply @micheles.
I've tried to analyze in mode detail the workflow of some of the demo simulations; there is actually a multiprocessing phase where dozens of python processes are spawned (more than the number of processors requested by SLURM, probably that number is the number of total cores of the nodes): these processes are all:
/galileo/home/userexternal/cscaini0/.venvs/openquake_test/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=17, pipe_handle=49) --multiprocessing-fork
after a while these processes disappear, and only a single process called oq-worker
runs at 100% (1 core power used); finally other processes are spawned, possibly to collect and save the output files.
I cannot give you direct access to the SLURM cluster, but I can develop a solution if I receive some guidance.
Is there a moment in oq where is set the number of processors for multiprocessing? Because SLURM sets the environment variable SLURM_JOB_CPUS_PER_NODE
to the number of cores available per node during the session: this number could be passed to multiprocessing to create the correct number of processes for the run.
PS: I think this is a quite technical question which could be more useful to discuss here. If you feel you want to discuss in the ML, I'll repost there.
Yes, please use the mailing list. GitHub is for code related issues or bugs but this is neither.
Hi, I'm testing out oq on a scientific cluster, using SLURM as batch manager, and running on a computing nodes with 36 cores, using python 3.6.
The script I'm using is this:
During the execution of the demo I was using (h)top to check the number of processes on the node, and only a single worker process is used. I'm interested in running only on a single node, so I didn't install all the celery-related packages from the requirements file since, according to the FAQ, multi-core parallelism is available also without celery.
Is there a config file to set to use all the cores available inside a node? Or does it depend on the simulation being executed?