ATS - Automated Testing System - is an open-source, Python-based tool for automating the running of tests of an application across a broad range of high performance computers.
Removed use of "--overlap" with Slurm 20.11.7.
Moved back to using "--exlusive" plus other
options. In my testing, we appear to be able
to use these options to achieve load balancing
using slurm again. This needs vetted by all
end users of ATS.
The options used to achieve load balancing in
general include:
And the values vary depending on the test type.
--cpus-per-task for instance will be >1 for
threaded tests, this will be set by the 'nt'
test option, which is also used to set
the OMP_NUM_THREADS env var for the test.
Also, refactored and slimmed down the code. Condensed
logic, removed a bunch of 'if' conditionals
for the return of the srun command.
More condensing is still possible, but will save
that for next update.
If on 'rzalastor' set npMax and npMaxH to 20.
Uniformly used '--comment=somecomment' for
srun options which are not used. Gets rid of
use of other place holders. Would like
to get rid of these in the future, but
the construction of return of the srun command
fails if these are empty strings. So use
comments for now. Not totally without value.
as it shows that we considered the option,
such as --unbuffered for instance, but chose
not to apply it.
Uniformly used "srun" for these srun options.
Prior to this some used the prefix "a" and others
"the_". This change that we are setting these
strings for use on the srun command line.
Removed a lot of old comments. They cluttered
the code, and with all the slurm updates, they
may not be applicable at this time. They may
be reviewed if needed by checking out an older
branch of the code.
Remove 'ATS WARNING:" In some testing,
the hang does not appear to be happening with
slurm 20.11.7. Let's remove the warning, but
monitor the situation. As a reminder, the
hang would occur in this simple scenario:
1) Pre-allocate 1 node (salloc for instance)
2) Run a job which utilizes all the cores on the node.
This would hang in prior slurm versions,
as the running of the 'ats' script was seen
as utilizing a core, and thus when ATS submitted
a job which needs all the cores, slurm would
accept but never run the job.
Removed use of "--overlap" with Slurm 20.11.7. Moved back to using "--exlusive" plus other options. In my testing, we appear to be able to use these options to achieve load balancing using slurm again. This needs vetted by all end users of ATS.
The options used to achieve load balancing in general include:
And the values vary depending on the test type. --cpus-per-task for instance will be >1 for threaded tests, this will be set by the 'nt' test option, which is also used to set the OMP_NUM_THREADS env var for the test.
Also, refactored and slimmed down the code. Condensed logic, removed a bunch of 'if' conditionals for the return of the srun command.
More condensing is still possible, but will save that for next update.
If on 'rzalastor' set npMax and npMaxH to 20.
Uniformly used '--comment=somecomment' for srun options which are not used. Gets rid of use of other place holders. Would like to get rid of these in the future, but the construction of return of the srun command fails if these are empty strings. So use comments for now. Not totally without value. as it shows that we considered the option, such as --unbuffered for instance, but chose not to apply it.
Uniformly used "srun" for these srun options. Prior to this some used the prefix "a" and others "the_". This change that we are setting these strings for use on the srun command line.
Removed a lot of old comments. They cluttered the code, and with all the slurm updates, they may not be applicable at this time. They may be reviewed if needed by checking out an older branch of the code.
Remove 'ATS WARNING:" In some testing, the hang does not appear to be happening with slurm 20.11.7. Let's remove the warning, but monitor the situation. As a reminder, the hang would occur in this simple scenario:
1) Pre-allocate 1 node (salloc for instance) 2) Run a job which utilizes all the cores on the node.
This would hang in prior slurm versions, as the running of the 'ats' script was seen as utilizing a core, and thus when ATS submitted a job which needs all the cores, slurm would accept but never run the job.