UCL-RITS / rcps-buildscripts

Scripts to automate package builds on RC Platforms
MIT License
39 stars 26 forks source link

Install Request: OpenMPI 4.1.1 #457

Open heatherkellyucl opened 3 years ago

heatherkellyucl commented 3 years ago

[IN:04819158], RCE-934.

Requested for Orca 5 binaries. (Could then also install those).

Thought this should be fast but OpenMPI 4.1.x needs a newer libpsm2 than we have in the image to be able to run on the OmniPath clusters, see #409 where we went back to 4.0.x instead.

Going to see if a source build of PSM2 will work - is supposed to be buildable from source on RHEL 7.2 onwards.

heatherkellyucl commented 3 years ago

OpenMPI config looking promising!

configure:333774: checking if MCA component mtl:psm2 can compile
configure:333776: result: yes

The result before was

configure:333631: WARNING: PSM2 needs to be version 11.2.173 or later. Disabling MTL.
configure:334237: checking if MCA component mtl:psm2 can compile
configure:334239: result: no
heatherkellyucl commented 3 years ago

Working across two nodes on Thomas!

Install PSM2 on OmniPath clusters (on Myriad we're using UCX)

Install OpenMPI 4.1.1 everywhere

heatherkellyucl commented 3 years ago

Myriad needs UCX 1.9.0 for OpenMPI 4.1.1 (bug in 1.8.0) to be able to run multi-node, changing to that.

heatherkellyucl commented 3 years ago

Now running fine multi-node on Myriad too.

heatherkellyucl commented 3 years ago

These modules needed on not-Myriad:

module unload -f compilers mpi
module load compilers/gnu/4.9.2
module load numactl/2.0.12
module load psm2/11.2.185/gnu-4.9.2
module load mpi/openmpi/4.1.1/gnu-4.9.2

These needed on Myriad:

module unload compilers mpi
module load compilers/gnu/4.9.2
module load numactl/2.0.12
module load binutils/2.29.1/gnu-4.9.2 
module load ucx/1.9.0/gnu-4.9.2 
module load mpi/openmpi/4.1.1/gnu-4.9.2
heatherkellyucl commented 3 years ago

Is not working across two nodes on Young...

node-c12m-005.22538PSM2 can't open hfi unit: -1 (err=23)
node-c12l-008.62402PSM2 can't open hfi unit: -1 (err=23)
node-c12m-005.22538hfi_userinit_internal: assign_context command failed: Device or resource busy
node-c12m-005.22538hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (1/3)
heatherkellyucl commented 1 year ago

For now we have set OMPI_MCA_btl=vader in the modulefile for mpi/openmpi/4.1.1/gnu-4.9.2 on the OmniPath clusters so it will work multi-node, even if a bit slower than it should if using a different transport layer.