chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.79k stars 420 forks source link

pbs-aprun launcher - misleading warning on non-Lustre filesystem #6617

Open ben-albrecht opened 7 years ago

ben-albrecht commented 7 years ago

When using CHPL_LAUNCHER=pbs-aprun on a non-Lustre-based filesystem, launching a Chapel program yields the following warning, with a required response to continue:

Warning: Executing this program from a non-Lustre file system may cause it
to be unlaunchable, or for file I/O to be performed on a non-local file system.
Continue anyway? ([y]/n)

From what I understand, the launcher code should be checking that the filesystem is shared between launching node and compute node(s). Instead, it is just checking if the filesystem is Lustre-based, and jumping to the conclusion based on that. (@gbtitus might elaborate more on this)

Configuration Information

CHPL_TARGET_PLATFORM: cray-xc
CHPL_TARGET_COMPILER: cray-prgenv-intel
CHPL_TARGET_ARCH: broadwell
CHPL_LOCALE_MODEL: flat
CHPL_COMM: ugni
CHPL_TASKS: qthreads
CHPL_LAUNCHER: pbs-aprun *
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_MAKE: gmake
CHPL_ATOMICS: intrinsics
  CHPL_NETWORK_ATOMICS: ugni
CHPL_GMP: gmp
CHPL_HWLOC: hwloc
CHPL_REGEXP: re2
CHPL_WIDE_POINTERS: struct
CHPL_AUX_FILESYS: none

Note: This is really more of a finish implementing a partially implemented feature, but I'm using type: Unimplemented Feature for lack of a better label at this time.

bradcray commented 7 years ago

the launcher code should be checking that the filesystem is shared between launching node and compute node(s).

What would be a strategy for doing this?

gbtitus commented 7 years ago

@mppf says he's done a similar thing in an MPI context, checking a particular path for existence and identicalness on two compute nodes. Here we'd probably want to use similar logic as his, but check that the CWD on the launch node also exists, and as the same path, on at least one compute node. This is admittedly a costly operation but there's not a better way to do this as far as I know. The current code is both insufficient/misleading (because it effectively asserts that a non-Lustre filesystem cannot be shared between launch and compute nodes) and potentially wrong (because it also implicitly asserts that all Lustre filesystems on the launch node are also present on the compute nodes).

ronawho commented 7 years ago

We could just remove the warning since it's never been very useful, and it's a problem for all launchers not just pbs-aprun

gbtitus commented 7 years ago

Yes, it's actually a little surprising to me that pbs-aprun is the only launcher that does this. The thing it's warning about can also occur when the plain aprun launcher or the slurm-srun launcher is being used on a Cray system. Probably all other launchers have some similar path to failure as well.

Removing the warning is a possibility. But what it's trying to warn about is actually a problem when it occurs. And when it does the resulting error message(s) aren't typically super-useful, so it would be nice if we had a reasonable way to warn accurately about this. I'm a little worried that launching a program onto a compute node to check that the CWD is legal there isn't "reasonable", though.

mppf commented 7 years ago

Could the launcher check, for example, that the SHA1 sum of the file to be launched on the remote end matches a SHA1 sum that was computed on the local end? I don't think we have to solve the general problem of detecting shared filesystems here...

gbtitus commented 7 years ago

The executable itself is actually the least of our problems, at least for aprun and pbs-aprun, because by default aprun transports the executable itself from the launch node to the computes. The CWD is a bigger deal, because both ALPS and (I'm pretty sure) slurm want to put you in the same path on the compute node(s) as you were in on the launch node. Also if the program references any files (not via redirection) those paths have to be valid on the compute nodes. Likewise any Chapel- or user-supplied shared objects have to be at the same path on the compute as they were when the program was linked. (Currently there aren't any Chapel-supplied shared objects, of course, but that may not always be the case.)

ben-albrecht commented 7 years ago

In case anyone else stumbles upon this, it can be temporarily worked around by launching Chapel programs with --quiet.

ben-albrecht commented 7 years ago

Since this issue impacts a Cray system that is used by external developers and Cray customers, often trying Cray hardware and/or Chapel out for the first time, I'd be in favor of dropping this warning for now leaving the task of developing a more principled solution as a future TODO.