galaxyproject / ansible-slurm

Ansible role for installing and managing the Slurm Workload Manager
88 stars 47 forks source link

Need help creating a minimal non-trivial playbook #33

Open marinegor opened 1 year ago

marinegor commented 1 year ago

Hi everyone, I'm struggling with creating a playbook that would install both control and execution nodes. Whatever I do, I end up with multiple nodes where each of them is a singleton cluster with single node in it, and there's no interconnectivity between them.

Minimal playbook:

- name: install SLURM cluster
  hosts: vm0
  roles:
    - role: galaxyproject.slurm
      become: True
  vars:
    slurm_roles: ['exec', 'dbd', 'controller']
    slurm_munge_key: munge.key

- name: SLURM execution hosts
  roles:
    - role: galaxyproject.slurm
      become: True
  hosts: vm1, vm2
  vars:
    slurm_munge_key: munge.key
    slurm_roles: ['exec']
    slurm_nodes:
      - name: "vm[1-2]"
        CoresPerSocket: 1
    slurm_partitions:
      - name: compute
        Default: YES
        MaxTime: UNLIMITED
        Nodes: "vm[1-2]"

and the output would be:

~/github/slurm_local
❯ ansible -i local.yml all -a 'sinfo'
vm1 | CHANGED | rc=0 >>
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle localhost
vm2 | CHANGED | rc=0 >>
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle localhost
vm0 | CHANGED | rc=0 >>
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle localhost

which is not what I intended.

Could anyone help drafting a correct playbook for such a case?

jp-um commented 1 year ago

Did you manage this please (if so, how?) ? I am in the same waters.

marinegor commented 1 year ago

Nope, couldn't do it.

mark-gerarts commented 4 months ago

This is a late response and probably no longer relevant for you, but maybe it helps someone in the future.

A working config where the controller node is also an executor:

- name: Controller
  hosts: vm01
  vars:
    slurm_roles: ["controller"]
  roles:
    - role: galaxyproject.slurm
      become: True

- name: Nodes
  hosts: vm01,vm02
  roles:
    - role: galaxyproject.slurm
      become: True
  vars:
    slurm_roles: ["exec"]
    slurm_config:
      SelectType: select/cons_tres
      SlurmctldHost: vm01
      SlurmdLogFile: /var/log/slurm/slurmd.log
      SlurmctldLogFile: /var/log/slurm/slurmctld.log
    slurm_nodes:
      - name: vm01
        CPUs: 16
        Boards: 1
        SocketsPerBoard: 4
        CoresPerSocket: 4
        ThreadsPerCore: 1
        RealMemory: 128740
        State: UNKNOWN
      - name: vm02
        CPUs: 48
        Boards: 1
        SocketsPerBoard: 2
        CoresPerSocket: 12
        ThreadsPerCore: 2
        RealMemory: 257324
        State: UNKNOWN
    slurm_partitions:
      - name: debug
        Default: YES
        MaxTime: UNLIMITED
        Nodes: ALL
        OverSubscribe: YES
        DefMemPerCPU: 1024
        SelectTypeParameters: CR_Core_Memory
    slurm_create_user: true
    slurm_user:
      comment: "Slurm Workload Manager"
      gid: 888
      group: slurm
      home: "/var/lib/slurm"
      name: slurm
      shell: "/usr/sbin/nologin"
      uid: 888
    # Manually created key
    slurm_munge_key: "munge.key"
marinegor commented 4 months ago

@mark-gerarts thanks, that's amazing!