cnr-ibf-pa / hbp-bsp-issues

Ticketing system for developers/testers and power users of the Brain Simulation Platform of the Human Brain Project
4 stars 0 forks source link

Implement CPUsPerNode in Jureca & Jureca Booster #444

Closed antonelepfl closed 5 years ago

antonelepfl commented 5 years ago

Hi @BerndSchuller , I would like to pass the number of CPUsPerNode so it's translated to #SBATCH --ntasks-per-node=... in the bsssubmit.. Piz-daint have already implemented this functionality.

Currently I get:

Tue Jun 18 14:35:51 CEST 2019: Could not submit job: Resource request <CPUsPerNode=68.0> is out of range. [XNJS error 31]
clupascu commented 5 years ago

I already use this on Jureca. I use 24 CPUsPerNode. Maybe 68 is out of range. I think 24 is the maximum you can use, but I am not sure.

BerndSchuller commented 5 years ago

the CPUsPerNode is translated to --tasks-per-node but there is a valid range of 0-48 (corresponding to the actual cpus per node)

BerndSchuller commented 5 years ago

you can see the valid resource ranges with a GET to the /rest/core/factories/default_target_system_factory endpoint

antonelepfl commented 5 years ago

Ok with 48 works. 1) Thanks for that endpoint Bernd 2) CPUsPerNode get's translated to --ntasks-per-node .

antonelepfl commented 5 years ago

3) In the past I was using up to 68 tasks per node in Jureca booster when I launch the srun command

DEBUG:__main__:cmd: ['srun', '--cpus-per-task=1', '--ntasks-per-node=68', '--ntasks=680', '--nodes', '10', '/p/project/cvsk25/vsk2514/HBP/jureca-booster/21-12-2018/install/install/linux-centos7-x86_64/intel-18.0.2/neurodamus-hippocampus-i23mxn/bin/special', '-NFRAME', '1024', '/p/project/cvsk25/vsk2514/HBP/jureca-booster/21-12-2018/install/install/linux-centos7-x86_64/intel-18.0.2/neurodamus-hippocampus-i23mxn/lib/hoclib/init.hoc', '-mpi']
...

and I get that number of process for running the job (stdout) numprocs=680

And that works.

In the bss_submit

#!/bin/bash
#SBATCH --job-name=Microcircuit
#SBATCH --partition=booster
#SBATCH --account=vsk25
#SBATCH --nodes=10
#SBATCH --ntasks-per-node=1
#SBATCH --time=180
#SBATCH --output=/p/scratch/cvsk25/unicore-jobs//a29165bb-da5a-424d-88ac-a4b951cf6e7c//stdout
#SBATCH --error=/p/scratch/cvsk25/unicore-jobs//a29165bb-da5a-424d-88ac-a4b951cf6e7c//stderr
#SBATCH --workdir=/p/scratch/cvsk25/unicore-jobs//a29165bb-da5a-424d-88ac-a4b951cf6e7c/
umask 77
/p/scratch/cvsk25/unicore-jobs//a29165bb-da5a-424d-88ac-a4b951cf6e7c//UNICORE_Job_1558016359193

So I'm a bit confused. If we are able to use 68 and we are using only 48 we are 'wasting' resources right?

You can see that old job in /p/scratch/cvsk25/unicore-jobs/a29165bb-da5a-424d-88ac-a4b951cf6e7c

BerndSchuller commented 5 years ago

True, the current resource handling mechanism is not well suited to heterogenous systems like jureca (with 48 cores per node) / jureca-booster (68 cores per node) This will improve with the next major release of UNICORE. As a workaround for now we can increase the limit to 68 (booster).

antonelepfl commented 5 years ago

If you prefer I can use 48 for the time being but then please let us know status about the new release to put it back to 68 for booster.

BerndSchuller commented 5 years ago

the limit is now set to 68

antonelepfl commented 5 years ago

Ok thank you. I would suggest to keep this issue open until the new version is deployed. What do you think guys?

BerndSchuller commented 5 years ago

I'd close it, since the problem at hand is solved.

antonelepfl commented 5 years ago

Ok thank you