centerforaisafety / cerberus-cluster

HPC cluster code and configurations for running on OCI
Universal Permissive License v1.0
4 stars 0 forks source link

Figure out why resize playbooks are failing #211

Closed andriy-safe-ai closed 3 months ago

andriy-safe-ai commented 1 year ago

We've commented out a few plays in the resize playbook because they were failing and preventing us from adding two nodes to the cluster. We need to figure out why these are failing and uncomment these plays.

steven-safeai commented 1 year ago

I manually downloaded all of the rpms and installed them

wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-devel-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-contribs-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-perlapi-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-torque-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-openlava-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-slurmctld-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-slurmdbd-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-pam_slurm-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-libpmi-23.02.1-1.el7.x86_64.rpm
wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/slurm-centos-slurmd-23.02.1-1.el7.x86_64.rpm

Download pymix

wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/tgnPgvPv68JpWqLklTNY86rBsJ0z7Ebp3zs7Ud4X2_R8TZFgpm26kh08QHKI3dXU/n/hpc/b/source/o/pmix/pmix-centos-3.2.4-1.el7.x86_64.rpm

wget https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/tgnPgvPv68JpWqLklTNY86rBsJ0z7Ebp3zs7Ud4X2_R8TZFgpm26kh08QHKI3dXU/n/hpc/b/source/o/pmix/pmix-centos-devel-3.2.4-1.el7.x86_64.rpm

Install them on each node

ssh compute-permanent-node-102 
cd /data/slurm_rpms
sudo yum install *
steven-safeai commented 1 year ago

Another error

TASK [etc-hosts : create bastion part of the /etc/hosts files for the compute nodes] ******************************************************************************************************************************
fatal: [watch-tower-bastion -> 127.0.0.1]: FAILED! => 
  msg: |-
    The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_fqdn'

    The error appears to be in '/opt/oci-hpc/playbooks/roles/etc-hosts/tasks/common.yml': line 2, column 3, but may
    be elsewhere in the file depending on the exact syntax problem.

    The offending line appears to be:

    ---
    - name: create bastion part of the /etc/hosts files for the compute nodes

/opt/oci-hpc/playbooks/roles/etc-hosts/templates/etc-hosts-bastion.j2"

vim /etc/ansible/hosts

Changed and added the ansible_fqdn

[bastion]
watch-tower-bastion ansible_host=172.16.0.25 ansible_user=opc role=bastion ansible_fqdn=watch-tower-bastion.public.cluster.oraclevcn.com
[slurm_backup]
watch-tower-backup ansible_host=172.16.0.8 ansible_user=opc role=bastion ansible_fqdn=watch-tower-backup.public.cluster.oraclevcn.com
[login]
watch-tower-login ansible_host=172.16.0.238 ansible_user=opc role=login ansible_fqdn=watch-tower-login.public.cluster.oraclevcn.com

Rerun the playbook

ansible-playbook -i /tmp/_etc_ansible_hosts_add /opt/oci-hpc/playbooks/resize_add.yml
steven-safeai commented 1 year ago

I commented out the download and installs in https://github.com/centerforaisafety/cerberus-cluster/blob/e2febaf354ea239501724fbae0e947cf0dc2bc4b/playbooks/roles/slurm/tasks/common.yml#L64-L110

https://github.com/centerforaisafety/cerberus-cluster/blob/e2febaf354ea239501724fbae0e947cf0dc2bc4b/playbooks/roles/slurm/tasks/common_pmix.yml#L14-L37

https://github.com/centerforaisafety/cerberus-cluster/blob/e2febaf354ea239501724fbae0e947cf0dc2bc4b/playbooks/roles/slurm/tasks/common_pmix.yml#L14-L37

https://github.com/centerforaisafety/cerberus-cluster/blob/e2febaf354ea239501724fbae0e947cf0dc2bc4b/playbooks/roles/slurm/tasks/compute-rack-aware.yml#L3-L9

I actually installed too much on the compute nodes but I think it should be okay?

steven-safeai commented 1 year ago

slurm and nodes are working. Weka isn't though.

steven-safeai commented 1 year ago

I ran this but it failed to properly mount /data

cat weka_hosts | xargs -I {} -P 0 scp weka-4.2.1.tar {}:/tmp/
cat weka_hosts | xargs -I {} -P 0 ssh {} "cd /tmp && tar xf weka-4.2.1.tar"
cat weka_hosts | xargs -I {} -P 0 ssh {} "cd /tmp/weka-4.2.1 && sudo ./install.sh" 
cat weka_hosts | xargs -I {} -P 1 ssh {} "hostname && weka local ps"

pdsh -w ^weka_hosts sudo weka local stop

pdsh -w ^weka_hosts sudo weka local rm --all -f

# set cores to be equal to number of drives per node (4 for prod cluster). Be sure to specify core ids that align with those setup in slurm.conf
pdsh -w ^weka_hosts sudo weka local setup container --name drives0 --cores 8 --core-ids 0,1,2,3,64,65,66,67 --only-drives-cores --net ens300

pdsh -w ^weka_hosts sudo weka local setup container --name compute0 --cores 10 --core-ids 4,5,6,7,68,69,70,71,72,73 --only-compute-cores --memory 128GB --base-port 14200 --net ens300 --join-ips $(cat manager)

pdsh -w ^weka_hosts sudo weka local setup container --name frontends0 --cores 4 --core-ids 8,9,74,75 --only-frontend-cores --base-port 14100 --net ens300 --join-ips $(cat manager)

pdsh -w ^weka_hosts sudo mkdir /mnt/weka
pdsh -w ^weka_hosts sudo mount -t wekafs default /data

Error

05:16:12 opc@compute-permanent-node-102:~/weka 
$ weka status 
  This host is in STEM mode

STEM mode means this host is not part of a cluster yet.
Create a new cluster by running:

  weka cluster create <hosts-hostnames>...

Or add this host to an existing cluster by running:

  weka -H <backend-hostname> cluster host add <this-hostname>

# Fixed by

# on node 102
weka -H compute-permanent-node-35 cluster host add compute-permanent-node-102

# on node 978
weka -H compute-permanent-node-35 cluster container add compute-permanent-node-978

Then I added the drives

# on node 102
weka cluster drive add 135 /dev/nvme{0..7}n1 --force

# on node 978
weka cluster drive add 136 /dev/nvme{0..7}n1 --force

drives were added but frontend never showed up. Now I'm lost need to contact weka for support

steven-safeai commented 1 year ago

Ways to see the errors

# on bastion
weka cluster container
weka cluster drive

# on node 102 or 978
$ ssh compute-permanent-node-102
$ weka status

weka local ps
andriy-safe-ai commented 1 year ago

I was able to add both nodes to Weka. I created a PR(https://github.com/centerforaisafety/eng-docs/pull/39) for our Weka docs with updated instruction for how to add new nodes.

steven-safeai commented 1 year ago

Was there anything else needed to recover the state? Did you need to deactivate them first or not?

andriy-safe-ai commented 1 year ago

@steven-basart

Updated https://github.com/centerforaisafety/eng-docs/pull/39 with a troubleshooting section that describes what should be done if this error is encountered in the future.

andriy-safe-ai commented 3 months ago

Resize playbooks are no longer failing on new cluster. Closing.