gem / oq-engine

OpenQuake Engine: a software for Seismic Hazard and Risk Analysis
https://github.com/gem/oq-engine/#openquake-engine
GNU Affero General Public License v3.0
378 stars 273 forks source link

oq engine multicore without celery #5072

Closed bebosudo closed 5 years ago

bebosudo commented 5 years ago

Hi, I'm testing out oq on a scientific cluster, using SLURM as batch manager, and running on a computing nodes with 36 cores, using python 3.6.

The script I'm using is this:

#!/usr/bin/env bash
#SBATCH -p partition_here
#SBATCH -J openquake_test
#SBATCH -t 1:00:00
#SBATCH --hint=nomultithread
#SBATCH -N 1
#SBATCH -n 36

module load python/3.6.4
source ~/.venvs/openquake_test/bin/activate
oq engine --run ~/openquake/oq-engine/demos/hazard/Disaggregation/job.ini

During the execution of the demo I was using (h)top to check the number of processes on the node, and only a single worker process is used. I'm interested in running only on a single node, so I didn't install all the celery-related packages from the requirements file since, according to the FAQ, multi-core parallelism is available also without celery.

Is there a config file to set to use all the cores available inside a node? Or does it depend on the simulation being executed?

micheles commented 5 years ago

On a single machine the engine use the multiprocessing library to distribute across the cores. It could be that SLURM is interfering with multiprocessing. We cannot tell you because we do not have access to a SLURM cluster. So, if you want help on this, give us access to a SLURM cluster :-)

PS: this is question for the mailing list, not an issue for GitHub, please repost on https://groups.google.com/forum/?hl=en&pli=1#!forum/openquake-users

bebosudo commented 5 years ago

Thanks for the reply @micheles. I've tried to analyze in mode detail the workflow of some of the demo simulations; there is actually a multiprocessing phase where dozens of python processes are spawned (more than the number of processors requested by SLURM, probably that number is the number of total cores of the nodes): these processes are all: /galileo/home/userexternal/cscaini0/.venvs/openquake_test/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=17, pipe_handle=49) --multiprocessing-fork after a while these processes disappear, and only a single process called oq-worker runs at 100% (1 core power used); finally other processes are spawned, possibly to collect and save the output files.

I cannot give you direct access to the SLURM cluster, but I can develop a solution if I receive some guidance. Is there a moment in oq where is set the number of processors for multiprocessing? Because SLURM sets the environment variable SLURM_JOB_CPUS_PER_NODE to the number of cores available per node during the session: this number could be passed to multiprocessing to create the correct number of processes for the run.

PS: I think this is a quite technical question which could be more useful to discuss here. If you feel you want to discuss in the ML, I'll repost there.

micheles commented 5 years ago

Yes, please use the mailing list. GitHub is for code related issues or bugs but this is neither.