Open ajfloeder opened 3 months ago
That flux commandline does not have --requires
, so it's going to a different rabbit on each iteration, right? Or is it coming back to the same rabbit, maybe sometimes or all the time?
It depends on the scheduler policy, but I'm guessing you're using the first
policy, so if that's the only job in the system, it should be going to the same compute node and rabbit every time.
I'm not sure where that flux command that is shown above is coming from. The test.log file shows the following command:
Thu Mar 7 08:30:15 PST 2024: flux run -vvv --wait-event=clean --requires host:tioga37 -q pdebug --nodes=1 --ntasks=1 --setattr=system.dw="#DW jobdw capacity=600GiB type=xfs name=jobxfs5" hostname
I omitted the --requires to avoid exposing the compute node's name, but I guess that has just happened above. Sorry for the confusion.
A new instance of this deadlock occurred on the compute side where the vgchange --lock-start
operation hangs. On the compute node, this operation is started by nnf-clientmountd
and is not killed if it takes too long.
manpage for lvmlockd(8)
A shared VG can be started after all the following are true: · lvmlockd is running · the lock manager is running · the VG's devices are visible on the system
The last line there, the VG's devices are visible on the system
is key.
In the latest instance of the deadlock on the compute node we see that clientmountd
attempts a vgchange --lock-start
command before the VG device has been discovered. This directly violates the last stipulation in lvmlockd
.
Mar 12 19:25:16 tioga39 clientmountd[413541]: 2024-03-12T19:25:16-07:00 DEBUG controllers.ClientMount Command Run {"ClientMount": {"name":"default-fluxjob-804506591016518656-0-computes","namespace":"tioga39"}, "index": 0, "command": "vgchange --lock-start a29de411-35dd-46d6-97fa-00534dc9ab3b_0_0"}
Mar 12 19:25:16 tioga39 kernel: dlm: Using TCP for communications
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: joining the lockspace group...
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: dlm_recover 1
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: group event done 0 0
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 102
Mar 12 19:25:16 tioga39 kernel: dlm: connecting to 102
Mar 12 19:25:16 tioga39 kernel: dlm: got connection from 31
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 41
Mar 12 19:25:16 tioga39 kernel: dlm: connecting to 41
Mar 12 19:25:16 tioga39 kernel: dlm: got connection from 28
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 40
Mar 12 19:25:16 tioga39 kernel: dlm: connecting to 40
Mar 12 19:25:16 tioga39 kernel: dlm: got connection from 34
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 39
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 38
Mar 12 19:25:16 tioga39 kernel: dlm: connecting to 38
Mar 12 19:25:16 tioga39 kernel: dlm: got connection from 35
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 36
Mar 12 19:25:16 tioga39 kernel: dlm: connecting to 36
Mar 12 19:25:16 tioga39 kernel: dlm: got connection from 40
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 35
Mar 12 19:25:16 tioga39 kernel: dlm: got connection from 27
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 34
Mar 12 19:25:16 tioga39 kernel: dlm: got connection from 29
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 33
Mar 12 19:25:16 tioga39 kernel: dlm: connecting to 33
Mar 12 19:25:16 tioga39 kernel: dlm: got connection from 36
Mar 12 19:25:16 tioga39 kernel: dlm: lvm_global: add member 31
Mar 12 19:25:16 tioga39 kernel: nvme nvme11: rescanning namespaces. !!!! Devices start showing up here
The reason for the hang appears to be that the vgchange --lock-start
takes the lvm_global
lock, but fails to find the VG. Since this doesn't fail everytime, there seems to be a timing window in which the acquisition of the global lock is more sticky than other times.
Some additional debugging that @behlendorf did on this issue:
After digging it to this a bit more I think you're on the right track. If I'm reading this right the last line of the lvmlockctl -i
output indicates that the vgchange --lock-start
command took the lvm_global
lock.
LK VG un ver 0
LK GL un ver 0
LW GL sh ver 0 pid 578704 (vgchange)
It was then unable to take the VG lock for the a29de411-35dd-46d6-97fa-00534dc9ab3b_0_0 namespace because the devices were not yet available. The NVMe namespaces were eventually discovered shortly after this, but apparently it doesn't retry in this case. This results in the command hanging. Subsequently, we killed the hung vgchange --lock-start
but there doesn't appear to be any kind of on-close file descriptor handler, or other mechanism, so while the process exits the dlm
still believes it's holding the lvm_global
lock as a writer. This is why any other vg*
command issued end up hanging.
Unfortunately, I wasn't able to manually release the lvm_global
lock, so I power cycled the compute node. I suspect it may be possible, but I didn't have any luck. After the node was rebooted I was able cancel the job and the workflow progressed through Teardown and cleaned everything up.
I think there are a couple takeaways:
vgchange --lock-*
can be dangerous and should be avoided. This seems like a good way to reproduce the issue in your environment.
Scenario:
Flux command:
On the 5th execution of the test, the workflow stalled trying to complete the PreRun state where the filesystem is mounted on the compute node.
That iteration started:
Thu Mar 7 08:30:15 PST 2024: Iteration # 5/100000
The last command attempted by the Rabbit was:
In the console log on the rabbit it looks like lvmlockd is blocked:
Refer to directory
/usr/workspace/rabbits/hangs/start_time_2024-03-07_08-26-04
for the log files.