centerforaisafety / cerberus-cluster

HPC cluster code and configurations for running on OCI
Universal Permissive License v1.0
4 stars 0 forks source link

Check slurm latency for /var/spool/slurmctld/ #115

Open steven-basart opened 1 year ago

steven-basart commented 1 year ago

I moved /var/spool/slurmctld/ to be saved onto NFS but this increases slurm latency and responsiveness. We need a better solution. Backup and delete logs once a month or something.

andriy-safe-ai commented 1 year ago

@steven-basart Would you elaborate more on what you mean by Slurm latency and responsiveness? Also did you mean to say that you moved the logs to the FSS?

steven-basart commented 1 year ago

I saw this comment Which prompted my creation of the issue.

Comment copied below as well for reference in case the page dies.

The file system for /var/spool/slurmctld/ should be mounted on the fastest possible disks (SSD or NVMe if possible).
steven-basart commented 1 year ago

I had moved them to the FSS and then moved the slurmd logs back off.

andriy-safe-ai commented 1 year ago

Where did you move the slurmd logs?

steven-basart commented 1 year ago

Check slurm.conf on the line called SlurmCtrlLogFile (I believe).

As an aside also look in the directory (the default one). I made symlinks from the default location to the new location on the FSS so it could be transparent change to Loki for instance.