LVMLOCKD deadlocks caused workflow to stall at PreRun

ajfloeder commented 3 months ago

Scenario:

Flux create XFS workflow
walk that workflow through the states to Teardown
delete the workflow
goto 1.

Flux command:

flux run -vvv --wait-event=clean -q pdebug --nodes=1 --ntasks=1 --setattr=system.dw="#DW jobdw capacity=600GiB type=xfs name=jobxfs4" hostname

On the 5th execution of the test, the workflow stalled trying to complete the PreRun state where the filesystem is mounted on the compute node.

That iteration started: Thu Mar 7 08:30:15 PST 2024: Iteration # 5/100000

The last command attempted by the Rabbit was:

2024-03-07T08:30:29.073-0800    DEBUG   controllers.NnfNodeStorage  Command Run {"NnfNodeStorage": {"name":"default-fluxjob-796660387691889664-0-xfs-0","namespace":"rabbit-name"}, "index": 0, "command": "vgchange --lock-start c6087d61-c97c-46fa-a56b-296b700a080f_0_0"}

In the console log on the rabbit it looks like lvmlockd is blocked:

2024-03-07 08:32:52 [71513.215982]       Tainted: P        W  OE  X --------- -  - 4.18.0-513.11.1.1
2024-03-07 08:32:52 ^[[23;80H1toss.t4.x86_64 #1
2024-03-07 08:32:52 [71513.221600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
2024-03-07 08:32:52 ^[[23;80H message.
2024-03-07 08:32:52 [71513.226798] task:lvmlockd        state:D stack:0     pid:595555 ppid:1      f
2024-03-07 08:32:52 ^[[23;80Hflags:0x00004080
2024-03-07 08:32:52 [71513.232493] Call Trace:
2024-03-07 08:32:52 [71513.234438]  __schedule+0x3e3/0x8d0
2024-03-07 08:32:52 [71513.236675]  ? wait_for_completion+0x7e/0x130
2024-03-07 08:32:52 [71513.239413]  schedule+0x72/0x120
2024-03-07 08:32:52 [71513.241537]  schedule_timeout+0x296/0x340
2024-03-07 08:32:52 [71513.244339]  ? srso_alias_return_thunk+0x5/0xfcdf2
2024-03-07 08:32:52 [71513.247438]  ? printk+0x58/0x73
2024-03-07 08:32:52 [71513.249333]  ? srso_alias_return_thunk+0x5/0xfcdf2
2024-03-07 08:32:52 [71513.252331]  ? wait_for_completion+0x7e/0x130
2024-03-07 08:32:52 [71513.255197]  wait_for_completion+0xc0/0x130
2024-03-07 08:32:52 [71513.259695]  new_lockspace+0x9fa/0xad0 [dlm]
2024-03-07 08:32:52 [71513.263377]  ? finish_wait+0x90/0x90
2024-03-07 08:32:52 [71513.266194]  dlm_new_lockspace+0x9f/0x130 [dlm]
2024-03-07 08:32:52 [71513.269957]  device_write+0x3d7/0x740 [dlm]
2024-03-07 08:32:52 [71513.273510]  ? srso_alias_return_thunk+0x5/0xfcdf2
2024-03-07 08:32:52 [71513.277662]  vfs_write+0xb7/0x1c0
2024-03-07 08:32:52 [71513.279783]  ksys_write+0x4f/0xb0
2024-03-07 08:32:52 [71513.282400]  do_syscall_64+0x5b/0x1b0
2024-03-07 08:32:52 [71513.285200]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
2024-03-07 08:32:52 [71513.289315] RIP: 0033:0x7ffff69b3a17
2024-03-07 08:32:52 [71513.292033] Code: Unable to access opcode bytes at RIP 0x7ffff69b39ed.
2024-03-07 08:32:52 [71513.296480] RSP: 002b:00007ffff1f9ea30 EFLAGS: 00000293 ORIG_RAX: 00000000000
2024-03-07 08:32:52 ^[[23;80H000001
2024-03-07 08:32:52 [71513.302368] RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007ffff69b3a17
2024-03-07 08:32:52 [71513.307482] RDX: 0000000000000082 RSI: 00007ffff1fa1b20 RDI: 0000000000000009
2024-03-07 08:32:52 [71513.313109] RBP: 00007ffff1fa1b20 R08: 0000000000000000 R09: 00007ffff6410fc0
2024-03-07 08:32:52 [71513.318569] R10: 00007fffec01b9e0 R11: 0000000000000293 R12: 0000000000000082
2024-03-07 08:32:52 [71513.323250] R13: 000000000000000a R14: 0000000000000006 R15: 00007fffec01b9d0

Refer to directory /usr/workspace/rabbits/hangs/start_time_2024-03-07_08-26-04 for the log files.

roehrich-hpe commented 3 months ago

That flux commandline does not have --requires, so it's going to a different rabbit on each iteration, right? Or is it coming back to the same rabbit, maybe sometimes or all the time?

jameshcorbett commented 3 months ago

It depends on the scheduler policy, but I'm guessing you're using the first policy, so if that's the only job in the system, it should be going to the same compute node and rabbit every time.

mcfadden8 commented 3 months ago

I'm not sure where that flux command that is shown above is coming from. The test.log file shows the following command:

Thu Mar  7 08:30:15 PST 2024:     flux run -vvv --wait-event=clean --requires host:tioga37 -q pdebug --nodes=1 --ntasks=1 --setattr=system.dw="#DW jobdw capacity=600GiB type=xfs name=jobxfs5" hostname

ajfloeder commented 3 months ago

I omitted the --requires to avoid exposing the compute node's name, but I guess that has just happened above. Sorry for the confusion.

ajfloeder commented 3 months ago

A new instance of this deadlock occurred on the compute side where the vgchange --lock-start operation hangs. On the compute node, this operation is started by nnf-clientmountd and is not killed if it takes too long.

manpage for lvmlockd(8) A shared VG can be started after all the following are true: · lvmlockd is running · the lock manager is running · the VG's devices are visible on the system The last line there, the VG's devices are visible on the system is key.

In the latest instance of the deadlock on the compute node we see that clientmountd attempts a vgchange --lock-start command before the VG device has been discovered. This directly violates the last stipulation in lvmlockd.

Mar 12 19:25:16 tioga39 clientmountd[413541]: 2024-03-12T19:25:16-07:00 DEBUG controllers.ClientMount Command Run {"ClientMount": {"name":"default-fluxjob-804506591016518656-0-computes","namespace":"tioga39"}, "index": 0, "command": "vgchange --lock-start a29de411-35dd-46d6-97fa-00534dc9ab3b_0_0"}
Mar 12 19:25:16 tioga39 kernel: dlm: Using TCP for communications
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: joining the lockspace group...
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: dlm_recover 1
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: group event done 0 0
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 102
Mar 12 19:25:16 tioga39 kernel: dlm: connecting to 102
Mar 12 19:25:16 tioga39 kernel: dlm: got connection from 31
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 41
Mar 12 19:25:16 tioga39 kernel: dlm: connecting to 41
Mar 12 19:25:16 tioga39 kernel: dlm: got connection from 28
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 40
Mar 12 19:25:16 tioga39 kernel: dlm: connecting to 40
Mar 12 19:25:16 tioga39 kernel: dlm: got connection from 34
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 39
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 38
Mar 12 19:25:16 tioga39 kernel: dlm: connecting to 38
Mar 12 19:25:16 tioga39 kernel: dlm: got connection from 35
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 36
Mar 12 19:25:16 tioga39 kernel: dlm: connecting to 36
Mar 12 19:25:16 tioga39 kernel: dlm: got connection from 40
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 35
Mar 12 19:25:16 tioga39 kernel: dlm: got connection from 27
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 34
Mar 12 19:25:16 tioga39 kernel: dlm: got connection from 29
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 33
Mar 12 19:25:16 tioga39 kernel: dlm: connecting to 33
Mar 12 19:25:16 tioga39 kernel: dlm: got connection from 36
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 31
Mar 12 19:25:16 tioga39 kernel: nvme nvme11: rescanning namespaces.   !!!! Devices start showing up here

The reason for the hang appears to be that the vgchange --lock-start takes the lvm_global lock, but fails to find the VG. Since this doesn't fail everytime, there seems to be a timing window in which the acquisition of the global lock is more sticky than other times.

Some additional debugging that @behlendorf did on this issue: After digging it to this a bit more I think you're on the right track. If I'm reading this right the last line of the lvmlockctl -i output indicates that the vgchange --lock-start command took the lvm_global lock. LK VG un ver 0 LK GL un ver 0 LW GL sh ver 0 pid 578704 (vgchange) It was then unable to take the VG lock for the a29de411-35dd-46d6-97fa-00534dc9ab3b_0_0 namespace because the devices were not yet available. The NVMe namespaces were eventually discovered shortly after this, but apparently it doesn't retry in this case. This results in the command hanging. Subsequently, we killed the hung vgchange --lock-start but there doesn't appear to be any kind of on-close file descriptor handler, or other mechanism, so while the process exits the dlm still believes it's holding the lvm_global lock as a writer. This is why any other vg* command issued end up hanging.

Unfortunately, I wasn't able to manually release the lvm_global lock, so I power cycled the compute node. I suspect it may be possible, but I didn't have any luck. After the node was rebooted I was able cancel the job and the workflow progressed through Teardown and cleaned everything up.

I think there are a couple takeaways:

Your change to wait for the NVMe namespaces to be visible should help with this. That's a change needed on the Rabbits and computes.
Terminating a running vgchange --lock-* can be dangerous and should be avoided. This seems like a good way to reproduce the issue in your environment.
We don't seem to have anyway to cleanup if this happens besides a reboot. That said, it seems like this should be possible.

NearNodeFlash / NearNodeFlash.github.io

LVMLOCKD deadlocks caused workflow to stall at PreRun #141