centerforaisafety / cerberus-cluster

HPC cluster code and configurations for running on OCI
Universal Permissive License v1.0
4 stars 0 forks source link

Update playbooks to download SLURM RPM files to `/tmp` always #268

Open andriy-safe-ai opened 7 months ago

andriy-safe-ai commented 7 months ago

The playbook that downloads SLURM RPM file is hardcoded to download to /data/slurm_rpms.

- hosts: all
  become: true
  vars:
    slurm_version: "23.02.1-1"
    slurm_all_packages:
      - "slurm-{{slurm_version}}.el{{ansible_distribution_major_version}}.x86_64.rpm"
      - "slurm-devel-{{slurm_version}}.el{{ansible_distribution_major_version}}.x86_64.rpm"
      - "slurm-contribs-{{slurm_version}}.el{{ansible_distribution_major_version}}.x86_64.rpm"
      - "slurm-perlapi-{{slurm_version}}.el{{ansible_distribution_major_version}}.x86_64.rpm"
      - "slurm-torque-{{slurm_version}}.el{{ansible_distribution_major_version}}.x86_64.rpm"
      - "slurm-openlava-{{slurm_version}}.el{{ansible_distribution_major_version}}.x86_64.rpm"
      - "slurm-slurmctld-{{slurm_version}}.el{{ansible_distribution_major_version}}.x86_64.rpm"
      - "slurm-slurmdbd-{{slurm_version}}.el{{ansible_distribution_major_version}}.x86_64.rpm"
      - "slurm-pam_slurm-{{slurm_version}}.el{{ansible_distribution_major_version}}.x86_64.rpm"
      - "slurm-libpmi-{{slurm_version}}.el{{ansible_distribution_major_version}}.x86_64.rpm"
      - "slurm-slurmd-{{slurm_version}}.el{{ansible_distribution_major_version}}.x86_64.rpm"
  tasks:
    - name: Download slurm .rpms
      get_url:
        url: "https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/VnkLhYXOSNVilVa9d24Riz1fz4Ul-KTXeK4HCKoyqv0ghW3gry3Xz8CZqloqphLw/n/hpc/b/source/o/slurm/{{ item }}"
        dest: "/data/slurm_rpms"
      with_items: "{{slurm_all_packages}}"
      delegate_to: 127.0.0.1
      run_once: true  
    - name: manually install all of the .rpms together (fails separately)
      shell: yum install -y /data/slurm_rpms/{{slurm_all_packages[0]}} \
        /data/slurm_rpms/{{slurm_all_packages[1]}} \
        /data/slurm_rpms/{{slurm_all_packages[2]}} \
        /data/slurm_rpms/{{slurm_all_packages[3]}} \
        /data/slurm_rpms/{{slurm_all_packages[4]}} \
        /data/slurm_rpms/{{slurm_all_packages[5]}} \
        /data/slurm_rpms/{{slurm_all_packages[6]}} \
        /data/slurm_rpms/{{slurm_all_packages[7]}} \
        /data/slurm_rpms/{{slurm_all_packages[8]}} \
        /data/slurm_rpms/{{slurm_all_packages[9]}} \
        /data/slurm_rpms/{{slurm_all_packages[10]}}
      # Needed in case you wish to rerun this playbook otherwise it'll error.
      ignore_errors: true

This is a problem because the path where our playbooks that read the SLURM RPM files changes depending on how we configured our cluster. Depending on the values of variables in /etc/ansible/hosts the path those playbooks will look for will change. For example, the slurm role takes a download_path variable as the path to the RPM files. Depending on whether we have the configured create_fss to true or cluster_nfs to true the behavior will change. By default, the slurm role would check in /tmp but this would never work since we've hardcoded the download path to /data/slurm_rpms.

- hosts: bastion,slurm_backup,compute,login
  gather_facts: true
  vars:
    destroy: false
    initial: true
    download_path: "{{ nfs_target_path if create_fss | bool else ( cluster_nfs_path if cluster_nfs|bool else '/tmp')  }}"
    enroot_top_path: "{{ nvme_path }}/enroot/"
  vars_files:
    - "/opt/oci-hpc/conf/queues.conf"
  tasks:
    - include_role:
        name: slurm
      when: slurm|default(true)|bool

One way to solve this would be by setting the default download path for the SLURM RPMs be to /tmp and have all playbooks look for the RPMs in /tmp.