Azure / cyclecloud-slurm

Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
MIT License
58 stars 43 forks source link

return_to_idle.sh fails when nodelist longer than 500 characters #127

Closed trapnine closed 5 months ago

trapnine commented 1 year ago

return_to_idle.sh gets a node list like this: $(sinfo -O nodelist:500,statelong -h | grep down~ | cut -d" " -f1)

This format specifier mangles the node list twice if nodelist is longer than 500 characters. This makes the scontrol drain command fail until the list gets shorter.

My suggested fix is to use this format specifier instead: $(sinfo -o '%N %T' -h | grep down~ | cut -d" " -f1)

There are two problems with -O: 1) -O with a length specifier prints a fixed-width field (not a max). If shorter than n, it fills with whitespace, which works OK, but if length of the field is over n, it will blindly truncate the node list, usually mangling it. 2) -O uses the provided separator character if the list is truncated. The script specifies a comma, but return_to_idle.sh splits the line on space. Thus, if the nodelist is truncated there is no space char to split on and the state name gets appended to the node list, mangling it further.

-o works better because: 1) It's not necessary to specify a minimum field width, since -o doesn't truncate to 20 chars by default. -o will use "whatever is needed to print the information". 2) the provided separator character always gets used, so there is always a space character between the fields for cut to split on.

Caveat: After this change there is a new risk of the command being too long if a very large nodelist is passed to scontrol. That limit is likely quite big on most modern systems, but can be set smaller. On our CentOS 7 nodes, getconf ARG_MAX returns 4611686018427387903, which seems unreasonably sufficient.

This bug was observed in 2.7.0.

ryanhamel commented 1 year ago

Target for release is 3.0.4

xpillons commented 8 months ago

@ryanhamel it's not in 3.0.6, maybe something you miss ?

xpillons commented 6 months ago

@aditigaur4 for reference

ryanhamel commented 5 months ago

3.0.7 fixes this by using a new azslurm command, return_to_idle, which takes no arguments and does not suffer from this issue.