clusterinthecloud / terraform

Terraform config for Cluster in the Cloud
https://cluster-in-the-cloud.readthedocs.io
MIT License
20 stars 23 forks source link

Available modules differ between management and compute nodes #18

Open joeiznogood opened 5 years ago

joeiznogood commented 5 years ago

I am trying to run some MPI jobs on my CitC. However, the available MPI modules are different on the management and compute nodes. I have compiled my code on the shared filesystem logged in on the management node, using the module mpi/openmpi-x86_64. However, when I then tried to load it on the compute nodes (as part of my job script), it told me that it did not exist.

Below are the listed available modules on the management node:

[joealex@mgmt run_test]$ module avail

--------------------------------------------------------- /usr/share/Modules/modulefiles --------------------------------------------------------- dot module-git module-info modules null use.own

---------------------------------------------------------------- /etc/modulefiles ---------------------------------------------------------------- mpi/mpich-3.2-x86_64 mpi/openmpi3-x86_64 mpi/openmpi-x86_64 And the compute node: [opc@vm-standard2-2-ad1-0001 ~]$ module avail

--------------------------------------------------------- /usr/share/Modules/modulefiles --------------------------------------------------------- dot module-git module-info modules null use.own

---------------------------------------------------------------- /etc/modulefiles ---------------------------------------------------------------- mpi/mpich-3.0-x86_64 mpi/mpich-x86_64 mpi/openmpi3-x86_64

There is only one that overlaps...

christopheredsall commented 5 years ago

Hi Joe,

Thanks for reporting that. It's an odd one, because we use the OS package manager to install the MPIs and we don't specify a version, just the name (Here's the commit that does that https://github.com/ACRC/slurm-ansible-playbook/commit/340846035060866b87583b82f560ea68ab223326 )

One difference between the management and compute nodes is that the *-devel packages are installed on the management node and only the runtime ones on the compute nodes. That could be where the divergence is coming in.

Since Oracle release regular updates to the OS images, and these are configured in both the terraform (for mgmt) and ansible (for compute) repos, there is a chance they might get out of sync. But that isn't the case here, they're currently both on the Feb 20 release:

Node Image Source
mgmt Oracle-Linux-7.6-2019.02.20-0 https://github.com/ACRC/oci-cluster-terraform/blob/827d73d5f4ef3ae6d7d6e4f071a6d6f20cb1d7d7/variables.tf#L36-L41
compute Oracle-Linux-7.6-2019.02.20-0 https://github.com/ACRC/slurm-ansible-playbook/blob/71947683f6a1a0da3d299fe65ea3200665589247/roles/slurm/files/citc_oci.py#L141-L146

We're still looking in to this. One option would be install a non OS MPI package with EasyBuild or spack - I'll see if I can come up with a tested workaround.

For "newer" MPIs we need a rebuild of the Slurm packages with PMIx support (https://github.com/ACRC/slurm-ansible-playbook/issues/24). @milliams is working on this (https://github.com/ACRC/slurm-ansible-playbook/pull/27) however he is away this week and I think because of the change we reverted in #17 we can't just point at the slurm18 branch in terrafrom.tfvars