Open alberto-scolari opened 1 year ago
For item two, discussion reveals that the debug layer should probably better throw exceptions that are then caught and returned to the calling exec
or hook
.
@alberto-scolari indicated he would like to clean up the MR further so we may consider this in draft state. Please ping here when the PR is ready for review.
LPF
mpimsg
engine currently does not pass post-install checks on Ubuntu 22.04 for several reasons:This issue tracks these problems. I pushed several workarounds for these problems on the branch associated to this issue, but some of them deserve better thinking than what I did.
In the following paragraphs I am detailing each issue with its current workaround.
1. the initialization routine breaks
The
mpimsg
engine is initialized in the routinempi_initializer
insrc/MPI/init.cpp
, which expectsint argc, char ** argv
as parameters to be passed toMPI_thread_Init()
.mpi_initializer
is invoked duringLD_PRELOAD
. However, the stack initialization withargc/argv
is a non-standard, undocumented feature of the Linux dynamic linker, probably removed in recent versions: the variables are random, related assertions may fail or any access toargv
results in segfault.Current solution: do not use
argc/argv
, the initialization routine now takes no inputs. Pros: problem solved in a robust way, no need to re-think the solution. Cons: cannot pass implementation-specific parameters to MPI initialization (not used in practice)2. the post-install debug checks hang
The post-install check at
post-install/post-install-test.cmake.in
, line 96, hangs withengine
=mpimsg
and anynprocs
(I manually tried 1, which works, but any bigger value does not). The MPI-spawned processes hang. This is due to the call tostd::abort()
atsrc/debug/core.cpp
, line L939. Some process/library of Ubuntu 22.04 (probably MPI itself, version 4.0 for Ubuntu 22.04) installs a signal handler forSIGABRT
(I checked it in the test), which causes the application to hang when the debug library callstd::abort()
.Current solution: skip post-install debug checks. It is clearly just a hack. A more refined solution would be to have an actual
lpf_abort()
routine callingMPI_Abort()
, but I don't know whether it is in the spirit of LPF. Another possible solution is to remove calls tostd::abort()
and change the test to properly handle failures. I am not an LPF expert, so I have no preference and there are maybe better solutions. Finally, one can intercept theSIGABRT
in each backend to handle failures and callMPI_Abort()
, although this may conflict with the underlying MPI implementation.3. detection of MPI with Clang fails
During MPI detection (
find_package(MPI)
in cmake/mpi.cmake) CMake cannot find it if the compiler passed is Clang. Probably, the compilation of some internal tests fails due to some compiler-specific options that CMake parses. For example, MPICH 4.0 in Ubuntu 22.04 has-flto=auto -ffat-lto-objects
in the variableMPI_C_COMPILE_OPTIONS
to enable Link-Time Optimization (LTO). This option causes Clang to fail, since the LTO information of MPI binary is built with gcc.Current solution: if the compiler is Clang, disable LTO during detection via
MPI_COMPILER_FLAGS="-fno-lto"
, which is appended at the end of internal compiler definitions. Pros: binaries are now built also with Clang. Cons: may cause performance degradation (probably small); implicitly assumes MPI to be built with gcc A robust solution may be very complex and may depend on CMake detection logic.