Algebraic-Programming / LPF

A minimal communication layer for the implementation of immortal algorithms and for facilitating their broad use.
Apache License 2.0
5 stars 1 forks source link

Passing post-install for `mpimsg` engine checks on Ubuntu 22.04 #6

Open alberto-scolari opened 1 year ago

alberto-scolari commented 1 year ago

LPF mpimsg engine currently does not pass post-install checks on Ubuntu 22.04 for several reasons:

  1. the initialization routine breaks
  2. the post-install debug checks hang
  3. the detection of MPI with Clang fails

This issue tracks these problems. I pushed several workarounds for these problems on the branch associated to this issue, but some of them deserve better thinking than what I did.

In the following paragraphs I am detailing each issue with its current workaround.

1. the initialization routine breaks

The mpimsg engine is initialized in the routine mpi_initializer in src/MPI/init.cpp, which expects int argc, char ** argv as parameters to be passed to MPI_thread_Init(). mpi_initializer is invoked during LD_PRELOAD. However, the stack initialization with argc/argv is a non-standard, undocumented feature of the Linux dynamic linker, probably removed in recent versions: the variables are random, related assertions may fail or any access to argv results in segfault.

Current solution: do not use argc/argv, the initialization routine now takes no inputs. Pros: problem solved in a robust way, no need to re-think the solution. Cons: cannot pass implementation-specific parameters to MPI initialization (not used in practice)

2. the post-install debug checks hang

The post-install check at post-install/post-install-test.cmake.in, line 96, hangs with engine = mpimsg and any nprocs (I manually tried 1, which works, but any bigger value does not). The MPI-spawned processes hang. This is due to the call to std::abort() at src/debug/core.cpp, line L939. Some process/library of Ubuntu 22.04 (probably MPI itself, version 4.0 for Ubuntu 22.04) installs a signal handler for SIGABRT (I checked it in the test), which causes the application to hang when the debug library call std::abort().

Current solution: skip post-install debug checks. It is clearly just a hack. A more refined solution would be to have an actual lpf_abort() routine calling MPI_Abort(), but I don't know whether it is in the spirit of LPF. Another possible solution is to remove calls to std::abort() and change the test to properly handle failures. I am not an LPF expert, so I have no preference and there are maybe better solutions. Finally, one can intercept the SIGABRT in each backend to handle failures and call MPI_Abort(), although this may conflict with the underlying MPI implementation.

3. detection of MPI with Clang fails

During MPI detection (find_package(MPI) in cmake/mpi.cmake) CMake cannot find it if the compiler passed is Clang. Probably, the compilation of some internal tests fails due to some compiler-specific options that CMake parses. For example, MPICH 4.0 in Ubuntu 22.04 has -flto=auto -ffat-lto-objects in the variable MPI_C_COMPILE_OPTIONS to enable Link-Time Optimization (LTO). This option causes Clang to fail, since the LTO information of MPI binary is built with gcc.

Current solution: if the compiler is Clang, disable LTO during detection via MPI_COMPILER_FLAGS="-fno-lto", which is appended at the end of internal compiler definitions. Pros: binaries are now built also with Clang. Cons: may cause performance degradation (probably small); implicitly assumes MPI to be built with gcc A robust solution may be very complex and may depend on CMake detection logic.

anyzelman commented 1 year ago

For item two, discussion reveals that the debug layer should probably better throw exceptions that are then caught and returned to the calling exec or hook.

@alberto-scolari indicated he would like to clean up the MR further so we may consider this in draft state. Please ping here when the PR is ready for review.