esi-neuroscience / acme

Asynchronous Computing Made ESI
https://esi-acme.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
11 stars 2 forks source link

Erroneous resource allocation in partition "E880" #60

Open pantaray opened 3 months ago

pantaray commented 3 months ago

Describe the problem Allocating a distributed computing client with custom CPU/mem settings in the E880 partition does not actually allocate the specified resources.

Steps To Reproduce

from acme import ParallelMap, esi_cluster_setup
myClient = esi_cluster_setup(n_workers=3, cores_per_worker=16, mem_per_worker="5GB", partition="E880", timeout=timedelta(minutes=10).total_seconds(), 
                             n_workers_startup=1, verbose=True, debug=True)

This produces sbatch scripts missing the CPU spec and thus using default core allocations:

#!/usr/bin/env bash#SBATCH -J dask-worker
#SBATCH -p E880
#SBATCH -n 1

Additional Information Changing the underlying SLURMCluster call to

 cluster = SLURMCluster(queue="E880", cores=16, memory="8GB", processes=16) 

fixes the problem:

scontrol show job 26317414
JobId=26317414 JobName=dask-worker
   ...
   NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=8G,node=1,billing=16
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=16 MinMemoryNode=8G MinTmpDiskNode=0
   ...

A possible fix in esi_cluster_setup could be the following change of line L203

     processes_per_worker = kwargs.pop("processes_per_worker", cores_per_worker)
pantaray commented 3 months ago

An additional bug emerged here as well: the job_directives_skip removes any lines from the generated sbatch that contain the specified string, e.g.,

cluster = SLURMCluster(cores=32, memory="4000MB", processes=4, queue="E880", job_cpu = 56, job_directives_skip=['--mem'])

print(cluster.job_script())

#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -p E880
#SBATCH -n 1
#SBATCH --cpus-per-task=56
#SBATCH -t 00:30:00

cluster = SLURMCluster(cores=32, memory="4000MB", processes=4, queue="E880", job_cpu =56, job_directives_skip=['t'])

print(cluster.job_script())
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -p E880
#SBATCH -n 1
#SBATCH --mem=4G