Describe the bug
Intermittently, compute nodes fail to enter service and go into down state because the slurmd service fails.
systemctl status slurmd shows that ConditionPathExists={{SlurmEtcDir}}/slurm.conf failed.
Manually starting the service succeeds, but it fails to start for the same reason after a reboot.
Commenting out this line allows the service to start.
It appears that the path is being checked by systemd before the file system is mounted even though
the service is dependent on the remote-fs.target.
To Reproduce
Run a couple of hundred jobs. About 1-2% fail.
Expected behavior
New nodes reliably enter service and run jobs.
Describe the bug Intermittently, compute nodes fail to enter service and go into down state because the slurmd service fails.
systemctl status slurmd
shows thatConditionPathExists={{SlurmEtcDir}}/slurm.conf
failed. Manually starting the service succeeds, but it fails to start for the same reason after a reboot. Commenting out this line allows the service to start. It appears that the path is being checked by systemd before the file system is mounted even though the service is dependent on the remote-fs.target.To Reproduce Run a couple of hundred jobs. About 1-2% fail.
Expected behavior New nodes reliably enter service and run jobs.