Open ben-albrecht opened 7 years ago
the launcher code should be checking that the filesystem is shared between launching node and compute node(s).
What would be a strategy for doing this?
@mppf says he's done a similar thing in an MPI context, checking a particular path for existence and identicalness on two compute nodes. Here we'd probably want to use similar logic as his, but check that the CWD on the launch node also exists, and as the same path, on at least one compute node. This is admittedly a costly operation but there's not a better way to do this as far as I know. The current code is both insufficient/misleading (because it effectively asserts that a non-Lustre filesystem cannot be shared between launch and compute nodes) and potentially wrong (because it also implicitly asserts that all Lustre filesystems on the launch node are also present on the compute nodes).
We could just remove the warning since it's never been very useful, and it's a problem for all launchers not just pbs-aprun
Yes, it's actually a little surprising to me that pbs-aprun is the only launcher that does this. The thing it's warning about can also occur when the plain aprun launcher or the slurm-srun launcher is being used on a Cray system. Probably all other launchers have some similar path to failure as well.
Removing the warning is a possibility. But what it's trying to warn about is actually a problem when it occurs. And when it does the resulting error message(s) aren't typically super-useful, so it would be nice if we had a reasonable way to warn accurately about this. I'm a little worried that launching a program onto a compute node to check that the CWD is legal there isn't "reasonable", though.
Could the launcher check, for example, that the SHA1 sum of the file to be launched on the remote end matches a SHA1 sum that was computed on the local end? I don't think we have to solve the general problem of detecting shared filesystems here...
The executable itself is actually the least of our problems, at least for aprun and pbs-aprun, because by default aprun transports the executable itself from the launch node to the computes. The CWD is a bigger deal, because both ALPS and (I'm pretty sure) slurm want to put you in the same path on the compute node(s) as you were in on the launch node. Also if the program references any files (not via redirection) those paths have to be valid on the compute nodes. Likewise any Chapel- or user-supplied shared objects have to be at the same path on the compute as they were when the program was linked. (Currently there aren't any Chapel-supplied shared objects, of course, but that may not always be the case.)
In case anyone else stumbles upon this, it can be temporarily worked around by launching Chapel programs with --quiet
.
Since this issue impacts a Cray system that is used by external developers and Cray customers, often trying Cray hardware and/or Chapel out for the first time, I'd be in favor of dropping this warning for now leaving the task of developing a more principled solution as a future TODO.
When using
CHPL_LAUNCHER=pbs-aprun
on a non-Lustre-based filesystem, launching a Chapel program yields the following warning, with a required response to continue:From what I understand, the launcher code should be checking that the filesystem is shared between launching node and compute node(s). Instead, it is just checking if the filesystem is Lustre-based, and jumping to the conclusion based on that. (@gbtitus might elaborate more on this)
Configuration Information
$CHPL_HOME/util/printchplenv --anonymize
:Note: This is really more of a finish implementing a partially implemented feature, but I'm using
type: Unimplemented Feature
for lack of a better label at this time.