LLNL / ATS

ATS - Automated Testing System - is an open-source, Python-based tool for automating the running of tests of an application across a broad range of high performance computers.
BSD 3-Clause "New" or "Revised" License
7 stars 5 forks source link

Updates for better runs under Slurm 20.11.7 #38

Closed dawson6 closed 3 years ago

dawson6 commented 3 years ago

Removed use of "--overlap" with Slurm 20.11.7. Moved back to using "--exlusive" plus other options. In my testing, we appear to be able to use these options to achieve load balancing using slurm again. This needs vetted by all end users of ATS.

The options used to achieve load balancing in general include:

--exclusive
--nodes
--distribution
--cpus-per-task
--ntasks

And the values vary depending on the test type. --cpus-per-task for instance will be >1 for threaded tests, this will be set by the 'nt' test option, which is also used to set the OMP_NUM_THREADS env var for the test.

Also, refactored and slimmed down the code. Condensed logic, removed a bunch of 'if' conditionals for the return of the srun command.

More condensing is still possible, but will save that for next update.

If on 'rzalastor' set npMax and npMaxH to 20.

Uniformly used '--comment=somecomment' for srun options which are not used. Gets rid of use of other place holders. Would like to get rid of these in the future, but the construction of return of the srun command fails if these are empty strings. So use comments for now. Not totally without value. as it shows that we considered the option, such as --unbuffered for instance, but chose not to apply it.

Uniformly used "srun" for these srun options. Prior to this some used the prefix "a" and others "the_". This change that we are setting these strings for use on the srun command line.

Removed a lot of old comments. They cluttered the code, and with all the slurm updates, they may not be applicable at this time. They may be reviewed if needed by checking out an older branch of the code.

Remove 'ATS WARNING:" In some testing, the hang does not appear to be happening with slurm 20.11.7. Let's remove the warning, but monitor the situation. As a reminder, the hang would occur in this simple scenario:

1) Pre-allocate 1 node (salloc for instance) 2) Run a job which utilizes all the cores on the node.

This would hang in prior slurm versions, as the running of the 'ats' script was seen as utilizing a core, and thus when ATS submitted a job which needs all the cores, slurm would accept but never run the job.