Julia-Tempering / Pigeons.jl

Sampling from intractable distributions, with support for distributed and parallel methods
https://pigeons.run/dev/
GNU Affero General Public License v3.0
86 stars 10 forks source link

MPI runs fail with Intel v2023.0.0 #141

Open dominic-chang opened 1 year ago

dominic-chang commented 1 year ago

Running the MPI example:

using Pigeons
result = pigeons(
    target = toy_mvn_target(100), 
    checkpoint = true, 
    on = ChildProcess(
            n_local_mpi_processes = 4))

with openmpiv4.1.5 on intelv2023.0.0 results in the following error:

ERROR: ERROR: LoadError: LoadError: AssertionError: all(1 .≤ to_global_indices .≤ e.load.n_global_indices)
Stacktrace:
  [1] transmit!(e::Pigeons.Entangler, source_data::Vector{Int64}, to_global_indices::Vector{Int64}, write_received_data_here::Vector{Int64})
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/mpi_utils/Entangler.jl:136
  [2] transmit
    @ ~/.julia/packages/Pigeons/nM8Nq/src/mpi_utils/Entangler.jl:100 [inlined]
  [3] permuted_get(p::Pigeons.PermutedDistributedArray{Int64}, indices::Vector{Int64})
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/mpi_utils/PermutedDistributedArray.jl:75
  [4] swap!(pair_swapper::Vector{Pigeons.ScaledPrecisionNormalLogPotential}, replicas::EntangledReplicas{...}, swap_graph::Pigeons.OddEven)
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/swap/swap.jl:84
  [5] communicate!
    @ ~/.julia/packages/Pigeons/nM8Nq/src/pt/pigeons.jl:68 [inlined]
  [6] macro expansion
    @ ~/.julia/packages/Pigeons/nM8Nq/src/pt/pigeons.jl:51 [inlined]
  [7] macro expansion
    @ ./timing.jl:501 [inlined]
  [8] run_one_round!(pt::PT{...})
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/pt/pigeons.jl:49
  [9] pigeons(pt::PT{...})
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/pt/pigeons.jl:18
 [10] top-level scope
    @ ~/bamextension/results/all/2023-09-28-09-02-05-Gdtupdpb/.launch_script.jl:15
in expression starting at /n/home06/dochang/bamextension/results/all/2023-09-28-09-02-05-Gdtupdpb/.launch_script.jl:15
AssertionError: all(1 .≤ to_global_indices .≤ e.load.n_global_indices)
Stacktrace:
  [1] transmit!(e::Pigeons.Entangler, source_data::Vector{Int64}, to_global_indices::Vector{Int64}, write_received_data_here::Vector{Int64})
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/mpi_utils/Entangler.jl:136
  [2] transmit
    @ ~/.julia/packages/Pigeons/nM8Nq/src/mpi_utils/Entangler.jl:100 [inlined]
  [3] permuted_get(p::Pigeons.PermutedDistributedArray{Int64}, indices::Vector{Int64})
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/mpi_utils/PermutedDistributedArray.jl:75
  [4] swap!(pair_swapper::Vector{Pigeons.ScaledPrecisionNormalLogPotential}, replicas::EntangledReplicas{...}, swap_graph::Pigeons.OddEven)
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/swap/swap.jl:84
  [5] communicate!
    @ ~/.julia/packages/Pigeons/nM8Nq/src/pt/pigeons.jl:68 [inlined]
  [6] macro expansion
    @ ~/.julia/packages/Pigeons/nM8Nq/src/pt/pigeons.jl:51 [inlined]
  [7] macro expansion
    @ ./timing.jl:501 [inlined]
  [8] run_one_round!(pt::PT{...})
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/pt/pigeons.jl:49
  [9] pigeons(pt::PT{...})
    @ Pigeons ~/.julia/packages/Pigeons/nM8Nq/src/pt/pigeons.jl:18
 [10] top-level scope
    @ ~/bamextension/results/all/2023-09-28-09-02-05-Gdtupdpb/.launch_script.jl:15
in expression starting at /n/home06/dochang/bamextension/results/all/2023-09-28-09-02-05-Gdtupdpb/.launch_script.jl:15  
miguelbiron commented 1 year ago

Hi -- can you post the output of versioninfo() please?

dominic-chang commented 1 year ago
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 4 on 16 virtual cores
Environment:
  LD_LIBRARY_PATH = /n/sw/helmod-rocky8/apps/Comp/intel/23.2.0-fasrc01/openmpi/4.1.5-fasrc03/lib64:/n/sw/intel-oneapi-2023.2/tbb/2023.2.0/lib/intel64:/n/sw/intel-oneapi-2023.2/mkl/2023.2.0/lib/intel64:/n/sw/intel-oneapi-2023.2/compiler/2023.2.0/linux/compiler/lib/intel64:/n/sw/intel-oneapi-2023.2/compiler/2023.2.0/linux/lib:/n/sw/helmod-rocky8/apps/Core/gcc/13.2.0-fasrc01/lib64:/n/sw/helmod-rocky8/apps/Core/mpc/1.3.1-fasrc02/lib64:/n/sw/helmod-rocky8/apps/Core/mpfr/4.2.1-fasrc01/lib64:/n/sw/helmod-rocky8/apps/Core/gmp/6.3.0-fasrc01/lib64:/usr/local/lib:/n/sw/helmod-rocky8/apps/Core/cuda/12.2.0-fasrc01/cuda/extras/CUPTI/lib64:/n/sw/helmod-rocky8/apps/Core/cuda/12.2.0-fasrc01/cuda/lib64:/n/sw/helmod-rocky8/apps/Core/cuda/12.2.0-fasrc01/cuda/lib::
  JULIA_NUM_THREADS = 4
miguelbiron commented 1 year ago

Thank you. The first thing to try here is to avoid using the system MPI and see if the example runs then. You can do this by not running setup_mpi but if you already did, you can manually delete the LocalPreferences.toml file that was created where your Project.toml lives. Let us know how it goes.

dominic-chang commented 1 year ago

Thanks. I ended up using the gcc compiler instead. This resolved the segfault issue that I was having.

dominic-chang commented 1 year ago

I was having an issue with openMPI and another dependency, so I ended up switching back to the intel compiler. This time I am using intelmpi v2021.10.0. I deleted the LocalPreferences.toml and ran pigeons without setup_mpi which resolved the segmentation fault issue but still failed immediately. Here are the contents of info/stderr.txt

 match_arg (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:91): unrecognized argument merge-stderr-to-stdout
 HYD_arg_parse_array (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:128): argument matching returned error
 mpiexec_get_parameters (../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:1359): error parsing input array
 main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1893): error parsing parameters

I am running this example from the tutorials:

mpi_run = pigeons(
    target = toy_mvn_target(1000000), 
    n_chains = 1000,
    checkpoint = true,
    on = MPI(
        n_mpi_processes = 1000,
        n_threads = 1))

I think this error is because the -output-filename and -merge-stderr-to-stdout are not flags for intel's version of mpiexec. The submission_script executes correctly if I replace these flags with their intel counterparts.

alexandrebouchard commented 1 year ago

Let me know what are the corresponding flags, that should be the basis of a relatively simple patch. The only missing piece is if there is a robust way to "detect" that mpiexec is Intel?

dominic-chang commented 7 months ago

Hi, sorry for taking so long to get around to this. Here's a link to a pull request with the proper flags.

After this change, an example execution that works on the Purdue Anvil cluster is

settings = Pigeons.MPISettings(;
submission_system=:slurm,
add_to_submission = [
    "#SBATCH -p wholenode",
    ], 
    environment_modules=["intel/19.0.5.281","impi/2019.5.281"]
)
Pigeons.setup_mpi(settings)

pt = Pigeons.pigeons(
    target=toy_mvn_target(10), 
    record = [traces, round_trip, Pigeons.timing_extrema], 
    checkpoint=true, 
    n_chains=200, 
 on = Pigeons.MPIProcesses(
        n_mpi_processes = 100,
        walltime="0-01:00:00",
        n_threads = 1,
        mpiexec_args=`--mpi=pmi2`
    ),
    n_rounds=10
)