Open ben-albrecht opened 7 years ago
Strictly speaking the cnselect returns a list of the available CPUs-per-node values rather than a list of the available nodes. But that's just a detail -- we still pick the smallest of those and that's not the right thing to do when we're already inside a WLM job which has a different value of CPUs-per-node, whether implicit or explicit.
We can get the list of nodes currently qsubbed with cat ${PBS_NODEFILE}
(separated by newlines).
It's not clear to me if we can pass that node list to cnselect
and get the available CPUs-per-node for that subset of nodes. Do you know if this should be possible @gbtitus?
cnselect
can't do that but we could use other tools. For example:
[gbt@crystal:] nodes=$(<$PBS_NODEFILE) [gbt@crystal:] xtprocadmin -n $(echo $nodes | tr ' ' ',') --attrs cpus NID (HEX) NODENAME TYPE CPUS 1012 0x3f4 c5-0c0s13n0 compute 72 1013 0x3f5 c5-0c0s13n1 compute 72
So the launcher should gather the cpus-per-node for each node and then select the minimum for the aprun -d
flag value.
One open question is if xtprocadmin
and $PBS_NODEFILE
are compatible across pbspro and moab/torque variants, as well across XE and XC.
I checked the qsub
man page on a corporate XE running PBS and a corporate XC running Moab/Torque and both said that they set PBS_NODEFILE
in the job's environment.
xtprocadmin
is separate from the workload manager. It comes from the sdb
module ("system database"?) rather than the WLM module. It is present on both systems I mentioned above.
Summary of Problem
The
CHPL_LAUNCHER=pbs-aprun
launcher can launch Chapel programs with the incorrect number ofcpus-per-pe
(specified by theaprun -d
flag).This problem likely stems from
pbs-aprun
launcher relying oncnselect
output to find thecpus-per-pe
, without the context of the current interactiveqsub
allocation. In other words, the launcher is looking from a list of all the available nodes and picking one, rather than looking at the list of nodes currently allocated byqsub
.In my case, I was getting
-d24
instead of-d44
.I assume this bug impacts any Cray XC40 running the PBS workload manager.
Execution command:
./foo -nl 1 -v
(for an arbitrary Chapel program)Configuration Information
chpl --version
:chpl Version 1.16.0 pre-release (2659cc6)
$CHPL_HOME/util/printchplenv --anonymize
:module list
:(Probably not necessary, but might as well)