charmed-hpc / slurm-charms

Juju charms for automating the Day 0 to Day 2 operations of the Slurm workload manager ⚖️🐧
Apache License 2.0
0 stars 3 forks source link

pmix plugin will not initialize without libpmix-dev #29

Open jamesbeedy opened 1 week ago

jamesbeedy commented 1 week ago

Bug Description

Slurmctld and slurmd processes cannot load the pmix plugin because the charms provide slurm built with pmix support, but then don't make the libs available at runtime, so the plugin cannot load.

To Reproduce

juju bootstrap localhost juju add-model slurm-test tox -e build juju deploy ./_build/slurmd.charm --constraints "virt-type=virtual-machine cores=4 mem=4G root-disk=20G" juju deploy ./_build/slurmctld.charm --constraints "virt-type=virtual-machine cores=4 mem=4G root-disk=20G" juju relate slurmctld slurmd

Environment

lxd provider, virtual-machines

Relevant log output

$ sudo cat /var/log/slurm/slurmctld.log
[2024-10-08T06:10:43.540] error: Configured MailProg is invalid
[2024-10-08T06:10:43.541] slurmctld version 23.02.7 started on cluster osd-cluster
[2024-10-08T06:10:43.542] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:10:43.542] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:10:43.542] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:10:43.545] No memory enforcing mechanism configured.
[2024-10-08T06:10:43.548] error: read_slurm_conf: default partition not set.
[2024-10-08T06:10:43.548] error: Could not open node state file /var/spool/slurmctld/node_state: No such file or directory
[2024-10-08T06:10:43.548] error: NOTE: Trying backup state save file. Information may be lost!
[2024-10-08T06:10:43.548] No node state file (/var/spool/slurmctld/node_state.old) to recover
[2024-10-08T06:10:43.549] error: Could not open job state file /var/spool/slurmctld/job_state: No such file or directory
[2024-10-08T06:10:43.549] error: NOTE: Trying backup state save file. Jobs may be lost!
[2024-10-08T06:10:43.549] No job state file (/var/spool/slurmctld/job_state.old) to recover
[2024-10-08T06:10:43.549] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:10:43.549] error: Could not open reservation state file /var/spool/slurmctld/resv_state: No such file or directory
[2024-10-08T06:10:43.549] error: NOTE: Trying backup state save file. Reservations may be lost
[2024-10-08T06:10:43.549] No reservation state file (/var/spool/slurmctld/resv_state.old) to recover
[2024-10-08T06:10:43.549] error: Could not open trigger state file /var/spool/slurmctld/trigger_state: No such file or directory
[2024-10-08T06:10:43.549] error: NOTE: Trying backup state save file. Triggers may be lost!
[2024-10-08T06:10:43.549] No trigger state file (/var/spool/slurmctld/trigger_state.old) to recover
[2024-10-08T06:10:43.549] read_slurm_conf: backup_controller not specified
[2024-10-08T06:10:43.549] Reinitializing job accounting state
[2024-10-08T06:10:43.549] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:10:43.549] Running as primary controller
[2024-10-08T06:10:43.549] No parameter for mcs plugin, default values set
[2024-10-08T06:10:43.549] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:10:44.543] Processing Reconfiguration Request
[2024-10-08T06:10:44.544] No memory enforcing mechanism configured.
[2024-10-08T06:10:44.544] error: read_slurm_conf: default partition not set.
[2024-10-08T06:10:44.544] restoring original state of nodes
[2024-10-08T06:10:44.545] error: Could not open job state file /var/spool/slurmctld/job_state: No such file or directory
[2024-10-08T06:10:44.545] error: NOTE: Trying backup state save file. Jobs may be lost!
[2024-10-08T06:10:44.545] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:10:44.545] read_slurm_conf: backup_controller not specified
[2024-10-08T06:10:44.545] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:10:44.545] No parameter for mcs plugin, default values set
[2024-10-08T06:10:44.545] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:10:44.547] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:10:44.547] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:10:44.547] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:10:44.547] reconfigure_slurm: completed usec=3269
[2024-10-08T06:10:44.547] error: Could not open job state file /var/spool/slurmctld/job_state: No such file or directory
[2024-10-08T06:10:44.547] error: NOTE: Trying backup state save file. Jobs may be lost!
[2024-10-08T06:10:44.547] No job state file (/var/spool/slurmctld/job_state.old) found
[2024-10-08T06:10:46.554] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2024-10-08T06:11:15.401] Terminate signal (SIGINT or SIGTERM) received
[2024-10-08T06:11:15.493] Saving all slurm state
[2024-10-08T06:11:15.551] error: Configured MailProg is invalid
[2024-10-08T06:11:15.553] slurmctld version 23.02.7 started on cluster osd-cluster
[2024-10-08T06:11:15.554] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:15.554] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:15.554] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:15.557] No memory enforcing mechanism configured.
[2024-10-08T06:11:15.559] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:15.559] Recovered state of 0 nodes
[2024-10-08T06:11:15.560] Recovered information about 0 jobs
[2024-10-08T06:11:15.560] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:15.560] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:15.560] Recovered state of 0 reservations
[2024-10-08T06:11:15.560] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:15.560] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:15.560] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:15.560] Running as primary controller
[2024-10-08T06:11:15.560] No parameter for mcs plugin, default values set
[2024-10-08T06:11:15.560] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:16.556] Processing Reconfiguration Request
[2024-10-08T06:11:16.557] No memory enforcing mechanism configured.
[2024-10-08T06:11:16.557] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:16.557] restoring original state of nodes
[2024-10-08T06:11:16.558] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:16.558] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:16.558] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:16.558] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:16.558] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:16.558] No parameter for mcs plugin, default values set
[2024-10-08T06:11:16.558] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:16.558] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:16.559] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:16.559] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:16.559] reconfigure_slurm: completed usec=2800
[2024-10-08T06:11:16.611] Terminate signal (SIGINT or SIGTERM) received
[2024-10-08T06:11:16.662] Saving all slurm state
[2024-10-08T06:11:16.712] error: Configured MailProg is invalid
[2024-10-08T06:11:16.713] slurmctld version 23.02.7 started on cluster osd-cluster
[2024-10-08T06:11:16.715] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:16.715] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:16.715] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:16.717] No memory enforcing mechanism configured.
[2024-10-08T06:11:16.720] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:16.720] Recovered state of 1 nodes
[2024-10-08T06:11:16.720] Down nodes: juju-1cc933-0
[2024-10-08T06:11:16.720] Recovered information about 0 jobs
[2024-10-08T06:11:16.720] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:16.721] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:16.721] Recovered state of 0 reservations
[2024-10-08T06:11:16.721] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:16.721] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:16.721] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:16.721] Running as primary controller
[2024-10-08T06:11:16.721] No parameter for mcs plugin, default values set
[2024-10-08T06:11:16.721] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:17.716] Processing Reconfiguration Request
[2024-10-08T06:11:17.717] No memory enforcing mechanism configured.
[2024-10-08T06:11:17.717] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:17.717] restoring original state of nodes
[2024-10-08T06:11:17.718] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:17.718] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:17.718] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:17.718] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:17.718] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:17.718] No parameter for mcs plugin, default values set
[2024-10-08T06:11:17.718] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:17.719] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:17.719] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:17.719] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:17.720] reconfigure_slurm: completed usec=3315
[2024-10-08T06:11:18.184] Terminate signal (SIGINT or SIGTERM) received
[2024-10-08T06:11:18.222] Saving all slurm state
[2024-10-08T06:11:18.275] error: Configured MailProg is invalid
[2024-10-08T06:11:18.276] slurmctld version 23.02.7 started on cluster osd-cluster
[2024-10-08T06:11:18.277] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:18.277] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:18.277] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:18.280] No memory enforcing mechanism configured.
[2024-10-08T06:11:18.282] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:18.282] Recovered state of 1 nodes
[2024-10-08T06:11:18.282] Down nodes: juju-1cc933-0
[2024-10-08T06:11:18.283] Recovered information about 0 jobs
[2024-10-08T06:11:18.283] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:18.283] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:18.283] Recovered state of 0 reservations
[2024-10-08T06:11:18.283] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:18.283] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:18.283] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:18.283] Running as primary controller
[2024-10-08T06:11:18.283] No parameter for mcs plugin, default values set
[2024-10-08T06:11:18.283] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:19.279] Processing Reconfiguration Request
[2024-10-08T06:11:19.279] No memory enforcing mechanism configured.
[2024-10-08T06:11:19.280] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:19.280] restoring original state of nodes
[2024-10-08T06:11:19.280] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:19.280] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:19.280] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:19.280] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:19.280] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:19.280] No parameter for mcs plugin, default values set
[2024-10-08T06:11:19.280] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:19.282] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:19.282] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:19.282] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:19.282] reconfigure_slurm: completed usec=3260
[2024-10-08T06:11:19.760] Terminate signal (SIGINT or SIGTERM) received
[2024-10-08T06:11:19.786] Saving all slurm state
[2024-10-08T06:11:19.839] error: Configured MailProg is invalid
[2024-10-08T06:11:19.840] slurmctld version 23.02.7 started on cluster osd-cluster
[2024-10-08T06:11:19.841] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:19.841] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:19.841] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:19.844] No memory enforcing mechanism configured.
[2024-10-08T06:11:19.854] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:19.854] Recovered state of 1 nodes
[2024-10-08T06:11:19.854] Down nodes: juju-1cc933-0
[2024-10-08T06:11:19.855] Recovered information about 0 jobs
[2024-10-08T06:11:19.855] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:19.855] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:19.855] Recovered state of 0 reservations
[2024-10-08T06:11:19.856] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:19.856] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:19.856] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:19.856] Running as primary controller
[2024-10-08T06:11:19.856] No parameter for mcs plugin, default values set
[2024-10-08T06:11:19.856] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:20.843] Processing Reconfiguration Request
[2024-10-08T06:11:20.844] No memory enforcing mechanism configured.
[2024-10-08T06:11:20.844] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:20.844] restoring original state of nodes
[2024-10-08T06:11:20.845] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:20.845] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:20.846] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:20.846] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:20.846] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:20.846] No parameter for mcs plugin, default values set
[2024-10-08T06:11:20.846] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:20.847] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:20.847] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:20.847] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:20.848] reconfigure_slurm: completed usec=4984
[2024-10-08T06:11:20.893] Terminate signal (SIGINT or SIGTERM) received
[2024-10-08T06:11:20.957] Saving all slurm state
[2024-10-08T06:11:21.016] error: Configured MailProg is invalid
[2024-10-08T06:11:21.017] slurmctld version 23.02.7 started on cluster osd-cluster
[2024-10-08T06:11:21.019] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:21.019] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:21.019] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:21.021] No memory enforcing mechanism configured.
[2024-10-08T06:11:21.024] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:21.024] Recovered state of 1 nodes
[2024-10-08T06:11:21.024] Down nodes: juju-1cc933-0
[2024-10-08T06:11:21.024] Recovered information about 0 jobs
[2024-10-08T06:11:21.024] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:21.024] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:21.024] Recovered state of 0 reservations
[2024-10-08T06:11:21.024] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:21.024] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:21.024] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:21.025] Running as primary controller
[2024-10-08T06:11:21.025] No parameter for mcs plugin, default values set
[2024-10-08T06:11:21.025] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:22.019] Processing Reconfiguration Request
[2024-10-08T06:11:22.020] No memory enforcing mechanism configured.
[2024-10-08T06:11:22.020] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:22.020] restoring original state of nodes
[2024-10-08T06:11:22.020] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:22.020] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:22.020] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:22.020] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:22.020] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:22.020] No parameter for mcs plugin, default values set
[2024-10-08T06:11:22.020] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:22.021] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:22.021] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:22.021] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:22.022] reconfigure_slurm: completed usec=2659
[2024-10-08T06:11:24.029] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

Additional context

No response

NucciTheBoss commented 1 week ago

Hmm... I thought that we were installing libpmix-dev as part of the common packages, no? https://github.com/charmed-hpc/slurm-charms/commit/232d4c0507cc22dbf980057bccfc7fbb30b0a522. I remember seeing this commit go through while I was at OpenInfra Asia.

Either way, we can upstream ensuring that the common packages are installed on slurmctld and slurmd nodes into slurm_ops. Would just be something like the following:

if self._service_name in ["slurmctld", "slurmd"]:
    apt.add_package(["libpmix-dev", "openmpi-bin"])
jamesbeedy commented 1 week ago

yeah ...

slurmd: [openmpi-bin, libpmix-dev]

slurmctld: [mailutils, libpmix-dev]

NucciTheBoss commented 1 week ago

Btw @jamesbeedy which branch are you working off of here? Is this main or experimental? Either way I'll ensure that slurm_ops installs the correct packages.