aws-samples / 1click-hpc

Deploy your HPC Cluster on AWS in 20min. with just 1-Click.
MIT No Attribution
62 stars 44 forks source link

pcluster update-cluster is deleting slurmdbd service #29

Closed rvencu closed 2 years ago

rvencu commented 2 years ago

Updating the cluster config from the Cloud9 instance is removing the slurmdbd service.

I am not sure maybe there is a personalized update procedure via enginframe portal instead?

rvencu commented 2 years ago

ok, manually starting slurmdbd works, so the config is not broken. But this does not start automatically, there is no service defined and maybe it should.

I do not know if not starting slurmdbd is by design or it should be corrected

nicolaven commented 2 years ago

it should be fixed now, thanks for pointing this out!

https://github.com/aws-samples/1click-hpc/commit/a46411b2306fbf34555cee6b7f8c4061876b6974

rvencu commented 2 years ago

hi for some reason the service slurmdbd do not start, whereas launching the script from console works

systemctl status slurmdbd
● slurmdbd.service - Slurm DB controller daemon
   Loaded: loaded (/etc/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Mon 2022-07-18 09:28:05 UTC; 33s ago
  Process: 25179 ExecStart=/opt/slurm/sbin/slurmdbd (code=exited, status=0/SUCCESS)
 Main PID: 25179 (code=exited, status=0/SUCCESS)
rvencu commented 2 years ago

yes, changing the service to this seems to work:

[Unit]
Description=Slurm DB controller daemon
After=network.target munge.service slurmctld.service
ConditionPathExists=/opt/slurm/etc/slurmdbd.conf

[Service]
Type=oneshot
ExecStart=/opt/slurm/sbin/slurmdbd
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65536
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target