ULHPC / puppet-slurm

A Puppet module designed to configure and manage SLURM(see https://slurm.schedmd.com/), an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters
Apache License 2.0
19 stars 24 forks source link

slurm-login nodes without daemons #7

Closed uvNikita closed 5 years ago

uvNikita commented 6 years ago

In our setup, we want to have login nodes that can initiate slurm commands (like srun or sbatch) but which are not part of the compute cluster. Therefore, all they need to have is installed and configured slurm, without any daemons.

If I understand correctly, it's not possible to achieve such configuration with this module currently. For instance, if we add slurm::slurmd class to the node specification, then slurmd will fail to start with the following error: fatal: Unable to determine this slurmd's NodeName (since it's not described in the slurm.conf file), and simply including slurm class doesn't create slurm.conf.

Let me know please if I missed something or if you have any thoughts regarding this issue.

uvNikita commented 6 years ago

As a workaround, I see that it is possible to include the login node description in the node list, but not to include the node to any partition. After that slurmd service starts successfully.

Falkor commented 6 years ago

Sorry for the delay, I just come back from SC'17 where I attended the Slurm user group meeting. Things should become easier with 17.11 to do what you wished as there will be a better split between the packages (slurm, slurmd, slurmctld and slurmdbd) allowing to better assign the packages depending on the roles of the nodes. I'll anyhow have to rework the module to make it compliant for this specific version and these new packages -- I'll keep you updated when I'll have done that.

uvNikita commented 6 years ago

Sounds great! I'll use the workaround in the meantime. It's OK for me if you would prefer to close this issue or leave it open until it will be solved in version 17.11.

Falkor commented 6 years ago

This module installs now by default slurm 17.11.3-2 with the above mentioned new split of the packages. It should be fine so I close this issue. Do not hesitate to reopen it if needed.

uvNikita commented 6 years ago

Thanks for working on the module!

I still don't see the way I can configure login nodes without running any daemons there. If I understand correctly, those nodes need only slurm::install and slurm::config classes, which are included only in classes that run other daemons as well (slurmd, slurmctld, slurmdbd).

Falkor commented 6 years ago

Hum actually we are running redundant login nodes (we call them access* nodes) that just run the slurmddaemon.

Here is an extract of the way we have them configured at the hiera level which is configured with the following hierarchy:

hierarchy:
  #______________________
  - name: "Per-node data"                   # Human-readable name.
    path: "nodes/%{trusted.certname}.yaml"  # File path, relative to datadir.
    #                               ^^^^^ IMPORTANT: include the file extension!
  #_________________________________________________
  - name: "Site/Datacenter/Domain/Zone Specific data"
    paths:
      - "domain/%{facts.domain}.yaml"
      - "site/%{facts.site}.yaml"
      - "zone/%{facts.zone}.yaml"
  #___________________________
  - name: "Role Specific data"
    path: "role/%{facts.role}.yaml"
  #_____________________________________________
  - name: "Sysadmins/DevOps/Research teams data"
    path: "team/%{facts.team}.yaml"
  #_________________________
  - name: "OS Version Specific data"        # Uses custom facts
    path: "osrelease/%{facts.os.family}-%{facts.os.release.major}.yaml"
  #_________________________
  - name: "OS Specific data"                # Uses custom facts
    path: "osfamily/%{facts.os.family}.yaml"
  #_____________________
  - name: "Common data"
    path: "common.yaml"

Then:

  1. Most SLURM parameters slurm::* are set at the site level i.e. in site/<site>.yaml
  2. Specific access nodes overrides are set as a role under roles/access.yaml as follows:
profiles:
- '::profile::access::<cluster>'
- '::profile::slurm::node'

slurm::manage_pam: true
slurm::service_manage: false

slurm::with_slurmdbd: false
slurm::with_slurmctld: false
slurm::with_slurmd: true
  1. To be complete, a slurm controller + DBD would be configured as follows:
# Profiles key may be used to include profile classes
profiles:
- '::profile::slurm'
- '::profile::slurm::slurmctld'

slurm::service_manage: false

slurm::with_slurmd: false
slurm::with_slurmctld: true
slurm::with_slurmdbd: true
  1. a backup controller would then be configured as follows:
# Profiles key may be used to include profile classes
profiles:
- '::profile::slurm'
- '::profile::slurm::slurmctld'

slurm::service_manage: false

slurm::with_slurmd: false
slurm::with_slurmctld: true
slurm::with_slurmdbd:  false
uvNikita commented 6 years ago

Yes, currently we have similar configuration when login nodes are running slurmd daemon.

But, as far as I understand, such nodes do not need to run slurmd daemon, they only require slurm.conf and installed slurm package with binaries (srun, squeue etc). For instance, when I stop slurmd daemon on one of our login nodes with systemctl stop slurmd, I can still execute sinfo and srun commands.

In addition, by disabling slurmd daemon, we no longer need to specify these nodes in slurm.conf file since cluster doesn't need to know anything about them.

Falkor commented 5 years ago

a new slurm::login class is in preparation...

Falkor commented 5 years ago

slurm::login class tested in the new Vagrant setup bring a simplified example of profiles.