aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
830 stars 312 forks source link

3.11.1 slurmctld core dumps with error message: double free or corruption (!prev) #6529

Open gwolski opened 4 days ago

gwolski commented 4 days ago

Attempting to move to parallelcluster 3.11.1. I have been using 3.9.1 (with its known bug since May) w/o this issue. 3.9.1 is custom Rocky 8.9 image with parallelcluster 3.9.1 overlay.

New set up is a custom Rocky 8.10 AMI upon which I have overlayed parallelcuster 3.11.1 with pcluster build.

Deployed just fine. Ran for a week with small number of jobs submitted. Started banging on it a bit more with hundreds of jobs/day. About every 24 hours, slurmctld dumps core with ultimate core dump error message in /var/log/messages:

Oct 30 11:04:30 ip-10-6-11-248 slurmctld[1374]: double free or corruption (!prev) Oct 30 11:04:30 ip-10-6-11-248 systemd[1]: Created slice system-systemd\x2dcoredump.slice. Oct 30 11:04:30 ip-10-6-11-248 systemd[1]: Started Process Core Dump (PID 2977/UID 0). Oct 30 11:04:30 ip-10-6-11-248 systemd-coredump[2978]: Process 1374 (slurmctld) of user 401 dumped core.#012#012Stack trace of thread 2531:#012#0

Reboot the machine and 24 hours later:

Oct 31 10:48:04 ip-10-6-11-248 slurmctld[1383]: corrupted double-linked list Oct 31 10:48:04 ip-10-6-11-248 systemd[1]: Created slice system-systemd\x2dcoredump.slice. Oct 31 10:48:04 ip-10-6-11-248 systemd[1]: Started Process Core Dump (PID 728511/UID 0). Oct 31 10:48:04 ip-10-6-11-248 systemd-coredump[728512]: Process 1383 (slurmctld) of user 401 dumped core.#012#012Stack trace of thread 728510:#012#0

Prior error messages right before this are of the form: Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-21:6818) failed: Name or service not known Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-21" Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: unable to split forward hostlist Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: _thread_per_group_rpc: no ret_list given Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-25:6818) failed: Name or service not known Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-25" Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: unable to split forward hostlist Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: _thread_per_group_rpc: no ret_list given [duplicate messages not included, just different node names]

Oct 30 11:04:30 ip-10-6-11-248 slurmctld[1374]: slurmctld: agent/is_node_resp: node:sp-m7a-l-dy-sp-8-gb-2-cores-4 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication conne [duplicate messages not included, just different node names]

Error messages in slurmctld.log from same time as the second crash message above:

[2024-10-31T10:47:04.008] error: unable to split forward hostlist [2024-10-31T10:47:04.008] error: _thread_per_group_rpc: no ret_list given [2024-10-31T10:47:05.134] error: slurm_receive_msg [10.6.2.57:50156]: Zero Bytes were transmitted or received [2024-10-31T10:47:18.718] error: slurm_receive_msg [10.6.9.248:41134]: Zero Bytes were transmitted or received [2024-10-31T10:47:20.862] error: slurm_receive_msg [10.6.14.229:53816]: Zero Bytes were transmitted or received [2024-10-31T10:47:22.137] error: slurm_receive_msg [10.6.2.57:50996]: Zero Bytes were transmitted or received [2024-10-31T10:48:04.000] cleanup_completing: JobId=4094 completion process took 134 seconds [2024-10-31T10:48:04.000] error: Nodes sp-r7a-m-dy-sp-8-gb-1-cores-37 not responding, setting DOWN [2024-10-31T10:48:04.003] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-10:6818) failed: Name or service not known [2024-10-31T10:48:04.003] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-10" [2024-10-31T10:48:04.005] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-11:6818) failed: Name or service not known [2024-10-31T10:48:04.005] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-11" [2024-10-31T10:48:04.007] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-33:6818) failed: Name or service not known [2024-10-31T10:48:04.007] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-33" [2024-10-31T10:48:04.009] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-36:6818) failed: Name or service not known [2024-10-31T10:48:04.009] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-36" [2024-10-31T10:48:04.010] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores

Has anyone seen this? I'm going back to 3.10.1 and will attempt to deploy that version.

gwolski commented 1 day ago

After the last restart I have not seen this issue again (four days ago). There was something wrong with my reverse DNS lookup that I did fix before the restart. I don't want to believe that is it, but isn't it always DNS? Nonetheless, I am increasing my testing, but I've also modified the systemctl service file for slurmctl to include the ability to self-restart.

# /etc/systemd/system/slurmctld.service
[Unit]
Description=Slurm controller daemon
After=network-online.target munge.service remote-fs.target
Wants=network-online.target
ConditionPathExists=/opt/slurm/etc/slurm.conf
StartLimitIntervalSec=30
StartLimitBurst=2

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmctld
ExecStart=/opt/slurm/sbin/slurmctld -D -s $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=562930
LimitMEMLOCK=infinity
LimitSTACK=infinity
Restart=on-failure
RestartSec=10s

[Install]
WantedBy=multi-user.target

Four new lines added: In the [Unit] section: StartLimitIntervalSec=30 StartLimitBurst=2

and in the [Service] section: Restart=on-failure RestartSec=10s

Maybe something you might want to consider adding to standard distribution?

himani2411 commented 19 hours ago

Hi @gwolski

From the previous error logs it does look like a Networking issue. Are you still facing an issue with changes you made to DNS?

Thank you for the suggestion on self-restarting slurmctld, will track this in our backlog.

gwolski commented 16 hours ago

My DNS server did not have a reverse lookup zone so slurmctld could not find my submission hosts. I don't know why this would be a requirement but once I added my submission hosts to the reverse lookup zone, things quieted down. So I am suspicious this was the cause.

I am now building a new 3.11.1 pcluster image with my latest custom AMI changes and will deploy in a new 3.11.1 cluster. Will run and then update this ticket.

I will add a feature request for the self-starting slurmctld so it is outside of this ticket.

demartinofra commented 5 hours ago

Hi @gwolski,

Thank you for reporting the extra details and keeping us posted about the progress.

In order for us to better investigate what has caused Slurm to crash, would you mind sharing the following if they are still available:

  1. The ParallelCluster cluster configuration file.
  2. The full slurmctld logs.
  3. The processed core dump file. Could you please run the following gdb command and share the output?
    gdb -ex 't a a bt' -batch /opt/slurm/sbin/slurmctld /var/spool/slurm.state/core.NNN > /tmp/core.out
  4. The full Slurm configuration. You can easily dump it with the command scontrol show config.

Thank you, Francesco