access-ci-org / Jetstream_Cluster

Scripts and Ansible Playbooks for building an HPC-style resource in Jetstream
MIT License
19 stars 16 forks source link

Problem: Package installation failure and missing slurm user #8

Closed julianpistorius closed 2 years ago

julianpistorius commented 2 years ago

@soichih & I both ran into the following problem. Looking at local_create.log we see the following:

...
Error: 
 Problem: problem with installed package Lmod-8.2.7-1.el8.x86_64
  - package lmod-ohpc-8.5.22-3.1.ohpc.2.4.x86_64 conflicts with Lmod provided by Lmod-8.2.7-1.el8.x86_64
  - cannot install the best candidate for the job
(try to add '--allowerasing' to command line to replace conflicting packages or '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)
...
chown: invalid user: ‘slurm:slurm’
setfacl: Option -m: Invalid argument near character 3
setfacl: Option -m: Invalid argument near character 3
chown: invalid user: ‘slurm:slurm’
chown: invalid user: ‘slurm:slurm’
chown: invalid user: ‘slurm:slurm’
setfacl: Option -m: Invalid argument near character 3
setfacl: Option -m: Invalid argument near character 3
chown: invalid user: ‘slurm:slurm’
chown: invalid user: ‘slurm:slurm’
cp: cannot create regular file '/etc/ansible/': Not a directory
cp: cannot create regular file '/etc/ansible/': Not a directory
Creating compute image! based on 
./install_local.sh: line 230: ansible-playbook: command not found
rm: cannot remove '/tmp/.ansible': No such file or directory
Failed to enable unit: Unit file slurmctld.service does not exist.
Failed to restart munge.service: Unit munge.service not found.
Failed to restart slurmctld.service: Unit slurmctld.service not found.
...

Afterwards there is also no sbatch binary/script anywhere on the head node, so it looks like Slurm is not installed at all.

DImuthuUpe commented 2 years ago

@julianpistorius can you confirm the branch you are using?

julianpistorius commented 2 years ago

@julianpistorius can you confirm the branch you are using?

rocky-linux

julianpistorius commented 2 years ago

Fix for this: https://github.com/XSEDE/CRI_Jetstream_Cluster/pull/9

I still see this when I run a job:

$ cat nodes_2.out 
environment: /usr/share/lmod/lmod/libexec/lmod: No such file or directory
environment: /usr/share/lmod/lmod/libexec/lmod: No such file or directory
/tmp/slurmd/job00002/slurm_script: line 8: mpirun: command not found

Update: But only when running sbatch or srun as root. Seems to work when I run those as the rocky user.

julianpistorius commented 2 years ago

Fixed by #9