ComputeCanada / puppet-magic_castle

Puppet Environment repo for Magic Castle - https://github.com/ComputeCanada/magic_castle
MIT License
12 stars 21 forks source link

Building `slurm-slurmd` fails on GPU node with CentOS8 and v11.2 #151

Closed SebastianAchilles closed 2 years ago

SebastianAchilles commented 3 years ago

When I build a cluster with this main.tf on JUSUF

terraform {
  required_version = ">= 0.14.2"
}

module "openstack" {
  source         = "./openstack"
  config_git_url = "https://github.com/ComputeCanada/puppet-magic_castle.git"
  config_version = "11.2"

  cluster_name = "jusuf"
  domain       = "domain.de"
  image        = "CentOS-8-GenericCloud-8.2.2004-20200611.2.x86_64"

  instances = {
    mgmt  = { type = "gpp.m", tags = ["puppet", "mgmt", "nfs"], count = 1 }
    jsfl  = { type = "gpp.l", tags = ["login", "public", "proxy"], count = 1 }
    jsfg  = { type = "gpu.l", tags = ["node"], count = 1 }
    jsfc  = { type = "gpp.l", tags = ["node"], count = 1 }
  }

  volumes = {
    nfs = {
      home     = { size = 100 }
      project  = { size = 50 }
      scratch  = { size = 50 }
    }
  }

  generate_ssh_key = true

  public_keys = [file("~/.ssh/id_rsa.pub")]

  nb_users     = 0
  # Shared password, randomly chosen if blank
  guest_passwd = ""

  # OpenStack specific
  os_floating_ips = { jsfl1 = "xxx.xxx.xxx.xxx"}

}

output "accounts" {
  value = module.openstack.accounts
}

output "public_ip" {
  value = module.openstack.public_ip
}

building slurm-slurmd fails on the GPU node with:

Okt 06 05:29:02 jsfg1.int.jusuf.domain.de puppet-agent[1011]: (/Stage[main]/Profile::Slurm::Node/Package[slurm-slurmd]/ensure) change from 'purged' to 'present' failed: Execution of '/usr/bin/dnf -d 0 -e 1 -y install slurm-slurmd' returned 1: Error:
Okt 06 05:29:02 jsfg1.int.jusuf.domain.de puppet-agent[1011]: (/Stage[main]/Profile::Slurm::Node/Package[slurm-slurmd]/ensure)  Problem: cannot install the best candidate for the job
Okt 06 05:29:02 jsfg1.int.jusuf.domain.de puppet-agent[1011]: (/Stage[main]/Profile::Slurm::Node/Package[slurm-slurmd]/ensure)   - nothing provides libhwloc.so.5()(64bit) needed by slurm-slurmd-20.11.7-1.el8.x86_64

To build libhwloc.so.5 I had to used

sudo yum install http://mirror.centos.org/centos/8-stream/BaseOS/x86_64/os/Packages/compat-hwloc1-2.2.0-3.el8.x86_64.rpm

because the hwloc for the OS repo was too new. On the CPU node building slurm-slurmd worked directly.

I also tested v11.4 and v11.5, but I got a different error. That is why I am using v11.2 at the moment.

cmd-ntrf commented 3 years ago

Magic Castle Slurm RPMs are built with COPR: https://copr.fedorainfracloud.org/coprs/cmdntrf/ slurm-slurmd-20.11.7-1.el8.x86_64 was built 4 months ago with hwloc-devel 1.11.9-3.el8. Unfortunately, CentOS 8 has since replaced hwloc and hwloc-devel with version 2.2.0-1.el8, hence the error you obtained.

I have trigger a rebuilt of slurm-slurmd RPM for CentOS 8 to build it against hwloc 2.2.0 instead. This should fix the issue.