Open heatherkellyucl opened 3 years ago
OpenMPI config looking promising!
configure:333774: checking if MCA component mtl:psm2 can compile
configure:333776: result: yes
The result before was
configure:333631: WARNING: PSM2 needs to be version 11.2.173 or later. Disabling MTL.
configure:334237: checking if MCA component mtl:psm2 can compile
configure:334239: result: no
Working across two nodes on Thomas!
Install PSM2 on OmniPath clusters (on Myriad we're using UCX)
Install OpenMPI 4.1.1 everywhere
Myriad needs UCX 1.9.0 for OpenMPI 4.1.1 (bug in 1.8.0) to be able to run multi-node, changing to that.
Now running fine multi-node on Myriad too.
These modules needed on not-Myriad:
module unload -f compilers mpi
module load compilers/gnu/4.9.2
module load numactl/2.0.12
module load psm2/11.2.185/gnu-4.9.2
module load mpi/openmpi/4.1.1/gnu-4.9.2
These needed on Myriad:
module unload compilers mpi
module load compilers/gnu/4.9.2
module load numactl/2.0.12
module load binutils/2.29.1/gnu-4.9.2
module load ucx/1.9.0/gnu-4.9.2
module load mpi/openmpi/4.1.1/gnu-4.9.2
Is not working across two nodes on Young...
node-c12m-005.22538PSM2 can't open hfi unit: -1 (err=23)
node-c12l-008.62402PSM2 can't open hfi unit: -1 (err=23)
node-c12m-005.22538hfi_userinit_internal: assign_context command failed: Device or resource busy
node-c12m-005.22538hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (1/3)
For now we have set OMPI_MCA_btl=vader
in the modulefile for mpi/openmpi/4.1.1/gnu-4.9.2
on the OmniPath clusters so it will work multi-node, even if a bit slower than it should if using a different transport layer.
[IN:04819158], RCE-934.
Requested for Orca 5 binaries. (Could then also install those).
Thought this should be fast but OpenMPI 4.1.x needs a newer libpsm2 than we have in the image to be able to run on the OmniPath clusters, see #409 where we went back to 4.0.x instead.
Going to see if a source build of PSM2 will work - is supposed to be buildable from source on RHEL 7.2 onwards.