Closed andriy-safe-ai closed 2 months ago
I manually downloaded all of the rpms and installed them
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-devel-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-contribs-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-perlapi-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-torque-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-openlava-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-slurmctld-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-slurmdbd-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-pam_slurm-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-libpmi-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-slurmd-23.02.1-1.el7.x86_64.rpm
Download pymix
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/tgnPgvPv68JpWqLklTNY86rBsJ0z7Ebp3zs7Ud4X2_R8TZFgpm26kh08QHKI3dXU/n/hpc/b/source/o/pmix/pmix-centos-3.2.4-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/tgnPgvPv68JpWqLklTNY86rBsJ0z7Ebp3zs7Ud4X2_R8TZFgpm26kh08QHKI3dXU/n/hpc/b/source/o/pmix/pmix-centos-devel-3.2.4-1.el7.x86_64.rpm
Install them on each node
ssh compute-permanent-node-102
cd /data/slurm_rpms
sudo yum install *
Another error
TASK [etc-hosts : create bastion part of the /etc/hosts files for the compute nodes] ******************************************************************************************************************************
fatal: [watch-tower-bastion -> 127.0.0.1]: FAILED! =>
msg: |-
The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_fqdn'
The error appears to be in '/opt/oci-hpc/playbooks/roles/etc-hosts/tasks/common.yml': line 2, column 3, but may
be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
---
- name: create bastion part of the /etc/hosts files for the compute nodes
/opt/oci-hpc/playbooks/roles/etc-hosts/templates/etc-hosts-bastion.j2"
vim /etc/ansible/hosts
Changed and added the ansible_fqdn
[bastion]
watch-tower-bastion ansible_host=172.16.0.25 ansible_user=opc role=bastion ansible_fqdn=watch-tower-bastion.public.cluster.oraclevcn.com
[slurm_backup]
watch-tower-backup ansible_host=172.16.0.8 ansible_user=opc role=bastion ansible_fqdn=watch-tower-backup.public.cluster.oraclevcn.com
[login]
watch-tower-login ansible_host=172.16.0.238 ansible_user=opc role=login ansible_fqdn=watch-tower-login.public.cluster.oraclevcn.com
Rerun the playbook
ansible-playbook -i /tmp/_etc_ansible_hosts_add /opt/oci-hpc/playbooks/resize_add.yml
I commented out the download and installs in https://github.com/centerforaisafety/cerberus-cluster/blob/e2febaf354ea239501724fbae0e947cf0dc2bc4b/playbooks/roles/slurm/tasks/common.yml#L64-L110
I actually installed too much on the compute nodes but I think it should be okay?
slurm and nodes are working. Weka isn't though.
I ran this but it failed to properly mount /data
cat weka_hosts | xargs -I {} -P 0 scp weka-4.2.1.tar {}:/tmp/
cat weka_hosts | xargs -I {} -P 0 ssh {} "cd /tmp && tar xf weka-4.2.1.tar"
cat weka_hosts | xargs -I {} -P 0 ssh {} "cd /tmp/weka-4.2.1 && sudo ./install.sh"
cat weka_hosts | xargs -I {} -P 1 ssh {} "hostname && weka local ps"
pdsh -w ^weka_hosts sudo weka local stop
pdsh -w ^weka_hosts sudo weka local rm --all -f
# set cores to be equal to number of drives per node (4 for prod cluster). Be sure to specify core ids that align with those setup in slurm.conf
pdsh -w ^weka_hosts sudo weka local setup container --name drives0 --cores 8 --core-ids 0,1,2,3,64,65,66,67 --only-drives-cores --net ens300
pdsh -w ^weka_hosts sudo weka local setup container --name compute0 --cores 10 --core-ids 4,5,6,7,68,69,70,71,72,73 --only-compute-cores --memory 128GB --base-port 14200 --net ens300 --join-ips $(cat manager)
pdsh -w ^weka_hosts sudo weka local setup container --name frontends0 --cores 4 --core-ids 8,9,74,75 --only-frontend-cores --base-port 14100 --net ens300 --join-ips $(cat manager)
pdsh -w ^weka_hosts sudo mkdir /mnt/weka
pdsh -w ^weka_hosts sudo mount -t wekafs default /data
Error
05:16:12 opc@compute-permanent-node-102:~/weka
$ weka status
This host is in STEM mode
STEM mode means this host is not part of a cluster yet.
Create a new cluster by running:
weka cluster create <hosts-hostnames>...
Or add this host to an existing cluster by running:
weka -H <backend-hostname> cluster host add <this-hostname>
# Fixed by
# on node 102
weka -H compute-permanent-node-35 cluster host add compute-permanent-node-102
# on node 978
weka -H compute-permanent-node-35 cluster container add compute-permanent-node-978
Then I added the drives
# on node 102
weka cluster drive add 135 /dev/nvme{0..7}n1 --force
# on node 978
weka cluster drive add 136 /dev/nvme{0..7}n1 --force
drives were added but frontend never showed up. Now I'm lost need to contact weka for support
Ways to see the errors
# on bastion
weka cluster container
weka cluster drive
# on node 102 or 978
$ ssh compute-permanent-node-102
$ weka status
weka local ps
I was able to add both nodes to Weka. I created a PR(https://github.com/centerforaisafety/eng-docs/pull/39) for our Weka docs with updated instruction for how to add new nodes.
Was there anything else needed to recover the state? Did you need to deactivate them first or not?
@steven-basart
Updated https://github.com/centerforaisafety/eng-docs/pull/39 with a troubleshooting section that describes what should be done if this error is encountered in the future.
Resize playbooks are no longer failing on new cluster. Closing.
We've commented out a few plays in the resize playbook because they were failing and preventing us from adding two nodes to the cluster. We need to figure out why these are failing and uncomment these plays.