awslabs / palace

3D finite element solver for computational electromagnetics
https://awslabs.github.io/palace/dev
Apache License 2.0
224 stars 50 forks source link

Error with NODE_LIST on Slurm clusters #228

Closed drkrynstrng closed 2 months ago

drkrynstrng commented 2 months ago

On Slurm clusters, this line in the palace script generates an error because NODE_LIST=$SLURM_JOB_NODELIST lists node names and is not a file (like PBS_NODEFILE):

https://github.com/awslabs/palace/blob/6c180aa8a127f224a04b3fce69ef17b085fb14d6/scripts/palace#L168

For example, with two allocated nodes named d05-41 and d05-42, SLURM_JOB_NODELIST=d05-[41-42] and the error is:

cat: d05-[41-42]: No such file or directory
--------------------------------------------------------------------------
No nodes are available for this job, either due to a failure to
allocate nodes to the job, or allocated nodes being marked
as unavailable (e.g., down, rebooting, or a process attempting
to be relocated to another node when none are available).
--------------------------------------------------------------------------

The following command will generate a file containing node hostnames within a Slurm job:

scontrol show hostnames $SLURM_JOB_NODELIST > $TMPDIR/nodefile.txt

For example, nodefile.txt will have one hostname per line:

d05-41
d05-42
sebastiangrimberg commented 2 months ago

Hi @drkrynstrng, thanks for bringing this up. Can you try out https://github.com/awslabs/palace/pull/229?

drkrynstrng commented 2 months ago

Yes, that works for me. Thanks!