Open dominic-chang opened 1 year ago
Hi -- can you post the output of versioninfo()
please?
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
Threads: 4 on 16 virtual cores
Environment:
LD_LIBRARY_PATH = /n/sw/helmod-rocky8/apps/Comp/intel/23.2.0-fasrc01/openmpi/4.1.5-fasrc03/lib64:/n/sw/intel-oneapi-2023.2/tbb/2023.2.0/lib/intel64:/n/sw/intel-oneapi-2023.2/mkl/2023.2.0/lib/intel64:/n/sw/intel-oneapi-2023.2/compiler/2023.2.0/linux/compiler/lib/intel64:/n/sw/intel-oneapi-2023.2/compiler/2023.2.0/linux/lib:/n/sw/helmod-rocky8/apps/Core/gcc/13.2.0-fasrc01/lib64:/n/sw/helmod-rocky8/apps/Core/mpc/1.3.1-fasrc02/lib64:/n/sw/helmod-rocky8/apps/Core/mpfr/4.2.1-fasrc01/lib64:/n/sw/helmod-rocky8/apps/Core/gmp/6.3.0-fasrc01/lib64:/usr/local/lib:/n/sw/helmod-rocky8/apps/Core/cuda/12.2.0-fasrc01/cuda/extras/CUPTI/lib64:/n/sw/helmod-rocky8/apps/Core/cuda/12.2.0-fasrc01/cuda/lib64:/n/sw/helmod-rocky8/apps/Core/cuda/12.2.0-fasrc01/cuda/lib::
JULIA_NUM_THREADS = 4
Thank you. The first thing to try here is to avoid using the system MPI and see if the example runs then. You can do this by not running setup_mpi
but if you already did, you can manually delete the LocalPreferences.toml file that was created where your Project.toml lives. Let us know how it goes.
Thanks. I ended up using the gcc compiler instead. This resolved the segfault issue that I was having.
I was having an issue with openMPI and another dependency, so I ended up switching back to the intel compiler. This time I am using intelmpi v2021.10.0. I deleted the LocalPreferences.toml and ran pigeons without setup_mpi
which resolved the segmentation fault issue but still failed immediately. Here are the contents of info/stderr.txt
match_arg (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:91): unrecognized argument merge-stderr-to-stdout
HYD_arg_parse_array (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:128): argument matching returned error
mpiexec_get_parameters (../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:1359): error parsing input array
main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1893): error parsing parameters
I am running this example from the tutorials:
mpi_run = pigeons(
target = toy_mvn_target(1000000),
n_chains = 1000,
checkpoint = true,
on = MPI(
n_mpi_processes = 1000,
n_threads = 1))
I think this error is because the -output-filename
and -merge-stderr-to-stdout
are not flags for intel's version of mpiexec
. The submission_script executes correctly if I replace these flags with their intel counterparts.
Let me know what are the corresponding flags, that should be the basis of a relatively simple patch. The only missing piece is if there is a robust way to "detect" that mpiexec is Intel?
Hi, sorry for taking so long to get around to this. Here's a link to a pull request with the proper flags.
After this change, an example execution that works on the Purdue Anvil cluster is
settings = Pigeons.MPISettings(;
submission_system=:slurm,
add_to_submission = [
"#SBATCH -p wholenode",
],
environment_modules=["intel/19.0.5.281","impi/2019.5.281"]
)
Pigeons.setup_mpi(settings)
pt = Pigeons.pigeons(
target=toy_mvn_target(10),
record = [traces, round_trip, Pigeons.timing_extrema],
checkpoint=true,
n_chains=200,
on = Pigeons.MPIProcesses(
n_mpi_processes = 100,
walltime="0-01:00:00",
n_threads = 1,
mpiexec_args=`--mpi=pmi2`
),
n_rounds=10
)
Running the MPI example:
with openmpiv4.1.5 on intelv2023.0.0 results in the following error: