ULHPC / puppet-slurm

A Puppet module designed to configure and manage SLURM(see https://slurm.schedmd.com/), an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters
Apache License 2.0
19 stars 25 forks source link

slurm-17.11.3 install fails #12

Closed uvNikita closed 5 years ago

uvNikita commented 6 years ago

I just tried to update one of the nodes by manually removing the packages and re-running the puppet as it was suggested here https://github.com/ULHPC/puppet-slurm/issues/8#issuecomment-346362648.

The output from puppet:

Notice: /Stage[main]/Slurm::Install/Slurm::Install::Packages[17.11.3]/Package[slurm]/ensure: created
Notice: /Stage[main]/Slurm::Install/Slurm::Install::Packages[17.11.3]/Package[slurm-contribs]/ensure: created
Notice: /Stage[main]/Slurm::Install/Slurm::Install::Packages[17.11.3]/Package[slurm-devel]/ensure: created
Notice: /Stage[main]/Slurm::Install/Slurm::Install::Packages[17.11.3]/Package[slurm-pam_slurm]/ensure: created
Notice: /Stage[main]/Slurm::Install/Slurm::Install::Packages[17.11.3]/Package[slurm-perlapi]/ensure: created
Notice: /Stage[main]/Slurm::Install/Slurm::Install::Packages[17.11.3]/Package[slurm-slurmdbd]/ensure: created
Notice: /Stage[main]/Slurm::Slurmdbd/Service[slurmdbd]/ensure: ensure changed 'stopped' to 'running'
journalctl log for slurmctld:
-- No entries --

Error: /Stage[main]/Slurm::Slurmctld/Service[slurmctld]/ensure: change from stopped to running failed: Systemd start for slurmctld failed!
journalctl log for slurmctld:
-- No entries --

The reason Service[slurmctld] failed is because there is no slurmctld service in the system. I'm guessing that the module didn't install slurm-slurmctld package which is located in /root/rpmbuild/RPMS/x86_64.

It seems that $common_rpms_basename or $slurmdbd_rpms_basename should reflect package changes introduced in new version of slurm.

Falkor commented 6 years ago

Hum, then most probably you installed with the wrong slurm::with_slurm* parameters? New packages names were adapted in b4e04c1c82 and 3b0f14698a82 for instance.

uvNikita commented 6 years ago

I didn't set with_slurm* but I included relevant classes: slurm::slurmctld and slurm::slurmd. If I understand correctly, it should be equivalent.

But I don't see where new slurm-slurmctld and slurm-slurmd packages (which used to be part of slurm package) are installed in these commits or anywhere else in the code.

jrbosch commented 6 years ago

Hello. I'm new using this module (great job) and I have the same problem that @uvNikita. In my case I used it with_slurmctld and with_slurmd paramiter. Regards Javier

jrbosch commented 6 years ago

Also to be able to download the source, I had to change the default $src_checksum value for 17.11.3 version. I think there were changes at slurm's source site. I think the module use 17.11.3 version with 17.11.3.2 checksum. Regards Javier.

uvNikita commented 6 years ago

I confirm @jrbosch comment about the checksum value.

As a termporary workaround, I had to add slurm-slurmctld and slurm-slurmd to common_rpms_basename:

https://github.com/uvNikita/puppet-slurm/blob/bc8b05671c80daf319f0d69455b862394c884401/manifests/params.pp#L444:#L445

Falkor commented 6 years ago

@uvNikita thanks for the temporary workaround. I'm not able to address these issues now as I have concurrent urgent milestones to tackle now, but we are planning to update slurm to at least 17.11.5 end of April on our cluster so in the worst case, I'll update the module at that time. Thanks all for using and reporting this module and sorry for the inconvenience.

bluemage650 commented 6 years ago

Is there any traction on this? The production branch seems to not like any of the current releases as of now.

Falkor commented 6 years ago

You can try to use the devel branch — I’ll make a new release once back from holidays.

Sent with my iPhone

Le 1 août 2018 à 15:46, Richard Quackenbush notifications@github.com<mailto:notifications@github.com> a écrit :

Is there any traction on this? The production branch seems to not like any of the current releases as of now.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/ULHPC/puppet-slurm/issues/12#issuecomment-409580019, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AALTrsJFZ7VjFU1vwl7OmbpUzbd5Krxoks5uMbEhgaJpZM4SWt0p.

Falkor commented 5 years ago

The module has been reworked and tested again 17.11.12 -- see params.pp#L398. Successfully tested on our side and within the new Vagrant configuration -- see #17. I guess I can close this issue, a new release 1.2 will come.