aws-samples / aws-eda-slurm-cluster

AWS Slurm Cluster for EDA Workloads
MIT No Attribution
28 stars 7 forks source link

[BUG] slurmd fails on new compute nodes because ConditionFileExists condition fails in service #133

Closed cartalla closed 1 year ago

cartalla commented 1 year ago

Describe the bug Intermittently, compute nodes fail to enter service and go into down state because the slurmd service fails. systemctl status slurmd shows that ConditionPathExists={{SlurmEtcDir}}/slurm.conf failed. Manually starting the service succeeds, but it fails to start for the same reason after a reboot. Commenting out this line allows the service to start. It appears that the path is being checked by systemd before the file system is mounted even though the service is dependent on the remote-fs.target.

To Reproduce Run a couple of hundred jobs. About 1-2% fail.

Expected behavior New nodes reliably enter service and run jobs.