JuliaParallel / MPI.jl

MPI wrappers for Julia
https://juliaparallel.org/MPI.jl/
The Unlicense
376 stars 121 forks source link

`mpiexecjl` doesn't handle juliaup with non-default channel #857

Open giordano opened 1 month ago

giordano commented 1 month ago

I have a system where the test introduced in #834 is failing:

MPIPreferences:
  binary:  OpenMPI_jll
  abi:     OpenMPI

Package versions
  MPI.jl:             0.20.20
  MPIPreferences.jl:  0.1.11
  OpenMPI_jll:        4.1.6+0

Library information:
  libmpi:  /home/cceamgi/.julia/artifacts/58dcf187642cdfbafb3581993ca3d8de565acc78/lib/libmpi.so
  libmpi dlpath:  /home/cceamgi/.julia/artifacts/58dcf187642cdfbafb3581993ca3d8de565acc78/lib/libmpi.so
  MPI version:  3.1.0
  Library version:
    Open MPI v4.1.6, package: Open MPI sabae@amdci7.julia.csail.mit.edu Distribution, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023
Hello world, I am rank 3 of 4
Hello world, I am rank 2 of 4
Hello world, I am rank 0 of 4
Hello world, I am rank 1 of 4
mpiexecjl: Test Failed at /home/cceamgi/.julia/packages/MPI/is7GN/test/mpiexecjl.jl:41
  Expression: p.exitcode == exit_code
   Evaluated: 1 == 10

I need to investigate what's wrong with this. For the record, this isn't specific to OpenMPI_jll, I see the same with MPICH_jll. I wonder if the problem is the shell, here /bin/sh is

$ /bin/sh --version
GNU bash, version 5.1.8(1)-release (aarch64-redhat-linux-gnu)
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
giordano commented 1 month ago

Ah, the problem is that Julia doesn't start at all, I can see errors like

ERROR: Unable to load dependent library /data/cceamgi/julia-depot/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12
Message:/data/cceamgi/julia-depot/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12: undefined symbol: unw_ensure_tls
ERROR: Unable to load dependent library /data/cceamgi/julia-depot/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12
ERROR: Unable to load dependent library /data/cceamgi/julia-depot/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12
Message:/data/cceamgi/julia-depot/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12: undefined symbol: unw_ensure_tls
ERROR: Unable to load dependent library /data/cceamgi/julia-depot/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12
Message:/data/cceamgi/julia-depot/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12: undefined symbol: unw_ensure_tls
Message:/data/cceamgi/julia-depot/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12: undefined symbol: unw_ensure_tls
giordano commented 3 weeks ago

On a different system I'm seeing the same outside of tests with Julia nightly:

$ ~/.julia/bin/mpiexecjl -np 1 --project julia +nightly -e ''
ERROR: Unable to load dependent library /home/mose/.julia/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12
Message:/home/mose/.julia/juliaup/julia-nightly/bin/../lib/julia/libjulia-internal.so.1.12: undefined symbol: unw_ensure_tls
┌ Error: The MPI process failed
│   proc = Process(setenv(`/home/mose/.julia/artifacts/62773cea33514bc12f48f228effadcb2ead6184a/bin/mpiexec -np 1 julia +nightly -e ''`,[...]), ProcessExited(1))
â”” @ Main none:7

I suspect this is a real issue with Julia v1.12

giordano commented 3 weeks ago

Ah, I understand the issue now, and I understand why JULIA_BINDIR solved the issue in #858. TL;DR: the issue arises with mpiexecjl when using juliaup with a channel different than the default one.

In https://github.com/JuliaParallel/MPI.jl/blob/780aaa0fdb768713a329659338a9c9cde23c41a8/bin/mpiexecjl#L54-L58 we run julia assuming it's in PATH (unless JULIA_BINDIR is set), but if I try to run mpiexecjl ... julia +nightly we're entering the script https://github.com/JuliaParallel/MPI.jl/blob/780aaa0fdb768713a329659338a9c9cde23c41a8/bin/mpiexecjl#L61-L70 with the default juliaup channel, setting up LD_LIBRARY_PATH for that version of Julia, which breaks down when we then try to start the other julia process: if that's a different version of Julia we're mixing up libraries for different versions of Julia. This also explains why we don't have problems here in CI: we don't use juliaup (let alone mixing up different channels).

I'm really not sure we have a good solution for this besides setting JULIA_BINDIR 🤔 Should we parse julia +channel specially in the script to deal with this? That'd complicate argument parsing quite a bit.