Workflows stuck - namespaces not appearing on node; ns-rescan recovers

Submitting 50 batch jobs with Flux each of which requests an xfs filesystem and copy_in resulted in two of the workflows getting stuck in teardown. The other 48 workflows do appear to have run correctly. The stuck workflows contained the following #DW directives (as did all of them) and are named fluxjob-167223879540933632 and fluxjob-167224799486018560.

#DW jobdw capacity=1TiB type=xfs name=test24                                                                                │
#DW copy_in source=/p/lslide/behlendo/test/copy_in-bw/ destination=$DW_JOB_test24/

We see the following error logged for both workflows. However, it looks like this error is usually transient since it's also logged for several of the other workflows which did complete.

2023-02-13T22:29:25.707Z        INFO    controllers.NnfWorkflow Start   {"Workflow": "default/fluxjob-167223879540933632", "state": "PreRun", "index": 0, "directive": "#DW jobdw capacity=1TiB type=xfs name=test24"}
2023-02-13T22:29:25.708Z        INFO    controllers.NnfWorkflow Start done      {"Workflow": "default/fluxjob-167223879540933632", "state": "PreRun", "index": 0}
2023-02-13T22:29:25.708Z        INFO    controllers.NnfWorkflow Finish  {"Workflow": "default/fluxjob-167223879540933632", "state": "PreRun", "index": 0, "directive": "#DW jobdw capacity=1TiB type=xfs name=test24"}
2023-02-13T22:29:25.708Z        INFO    controllers.NnfWorkflow Finish error    {"Workflow": "default/fluxjob-167223879540933632", "state": "PreRun", "index": 0, "Message": "Could not access file system on nodes: Error on NnfAccess default/fluxjob-167223879540933632-0-computes: Mount failed: Could not find VG/LV pair default-fluxjob-167223879540933632-0-xfs-0-0_168a260a-4b09-40e9-89ad-678f740e8d43/lv:   lv default-fluxjob-167223798171436032-0-xfs-0-0_e2c61343-c601-46a5-b06e-53ed24ec109e -wi-----p- <1023.94g        \n"}
2023-02-13T22:29:25.708Z        ERROR   Reconciler error        {"controller": "workflow", "controllerGroup": "dws.cray.hpe.com", "controllerKind": "Workflow", "Workflow": {"name":"fluxjob-167223879540933632","namespace":"default"}, "namespace": "default", "name": "fluxjob-167223879540933632", "reconcileID": "7c09a781-f4df-4fa5-ba71-01ffa6e2fc8e", "error": "Could not access file system on nodes: Error on NnfAccess default/fluxjob-167223879540933632-0-computes: Mount failed: Could not find VG/LV pair default-fluxjob-167223879540933632-0-xfs-0-0_168a260a-4b09-40e9-89ad-678f740e8d43/lv:   lv default-fluxjob-167223798171436032-0-xfs-0-0_e2c61343-c601-46a5-b06e-53ed24ec109e -wi-----p- <1023.94g        \n"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:326
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234

Both logical volumes do exist on the rabbit-p but are not mounted.

# lvs
  LV   VG                                                                                Attr       LSize     Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv   default-fluxjob-167223879540933632-0-xfs-0-0_168a260a-4b09-40e9-89ad-678f740e8d43 -wi-a----- <1023.94g                                                    
  lv   default-fluxjob-167224799486018560-0-xfs-0-0_f841185a-8f25-4a30-9596-a6adc25f1708 -wi-a----- <1023.94g

Inspecting the node manager logs seems to also show there may be some other issue.

2023-02-14T01:39:17.472-0800    INFO    controllers.NnfClientMount      Unmounted file system   {"ClientMount": "hetchy44/default-fluxjob-167223879540933632-0-servers", "Mount path": "/mnt/nnf/168a260a-4b09-40e9-89ad-678f740e8d43-0/0"}
W0214 07:38:11.186214  550988 reflector.go:347] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: watch of *v1alpha1.NnfNode ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0214 07:38:11.186403  550988 reflector.go:347] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: watch of *v1alpha1.Workflow ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0214 07:38:11.186305  550988 reflector.go:347] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: watch of *v1alpha1.NnfNodeStorage ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0214 07:38:11.186501  550988 reflector.go:347] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: watch of *v1alpha1.SystemConfiguration ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0214 07:38:11.186462  550988 reflector.go:347] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: watch of *v1.Namespace ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0214 07:38:11.186547  550988 reflector.go:347] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: watch of *v1alpha1.NnfNodeECData ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0214 07:38:11.186566  550988 reflector.go:347] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: watch of *v1alpha1.NnfNodeECData ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0214 07:38:11.186504  550988 reflector.go:347] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: watch of *v1alpha1.NnfNode ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0214 07:38:11.186581  550988 reflector.go:347] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: watch of *v1alpha1.NnfNodeStorage ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0214 07:38:11.186687  550988 reflector.go:347] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: watch of *v1alpha1.SystemConfiguration ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0214 07:38:11.186686  550988 reflector.go:347] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: watch of *v1alpha1.NnfStorageProfile ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0214 07:38:11.186723  550988 reflector.go:347] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: watch of *v1alpha1.Storage ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0214 07:38:11.186839  550988 reflector.go:347] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: watch of *v1alpha1.ClientMount ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0214 07:38:11.187162  550988 reflector.go:347] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: watch of *v1alpha1.ClientMount ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding

Digging a bit deeper the clientmountd logged this error on the compute node. Additionally, the logical volumes are in fact not reported by lvs on the compute, but there are NVMe namespaces still attached.

Feb 14 14:13:26 hetchy29 clientmountd[32052]: 1.6764128060150645e+09        INFO        controllers.ClientMount        Unmounting all file systems due to resource deletion        {"ClientMount": "hetchy29/default-fluxjob-167223879540933632-0-computes"}
Feb 14 14:13:26 hetchy29 clientmountd[32052]: 1.6764128062876759e+09        INFO        controllers.ClientMount        Could not find VG/LV pair default-fluxjob-167223879540933632-0-xfs-0-0_168a260a-4b09-40e9-89ad-678f740e8d43/lv:
Feb 14 14:13:26 hetchy29 clientmountd[32052]: 1.6764128062877223e+09        ERROR        controllers.ClientMount        Could not deactivate LVM volume        {"ClientMount": "hetchy29/default-fluxjob-167223879540933632-0-computes", "mount path": "/mnt/nnf/168a260a-4b09-40e9-89ad-678f740e8d43-0", "error": "Could not find VG/LV pair default-fluxjob-167223879540933632-0-xfs-0-0_168a260a-4b09-40e9-89ad-678f740e8d43/lv: "}
Feb 14 14:13:26 hetchy29 clientmountd[32052]: github.com/HewlettPackard/dws/mount-daemon/controllers.(*ClientMountReconciler).unmountAll
Feb 14 14:13:26 hetchy29 clientmountd[32052]:         /builddir/build/BUILD/dws-clientmount-1.0~beta4/mount-daemon/controllers/clientmount_controller.go:154
Feb 14 14:13:26 hetchy29 clientmountd[32052]: github.com/HewlettPackard/dws/mount-daemon/controllers.(*ClientMountReconciler).Reconcile
Feb 14 14:13:26 hetchy29 clientmountd[32052]:         /builddir/build/BUILD/dws-clientmount-1.0~beta4/mount-daemon/controllers/clientmount_controller.go:86
Feb 14 14:13:26 hetchy29 clientmountd[32052]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
Feb 14 14:13:26 hetchy29 clientmountd[32052]:         /builddir/build/BUILD/dws-clientmount-1.0~beta4/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121
Feb 14 14:13:26 hetchy29 clientmountd[32052]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
Feb 14 14:13:26 hetchy29 clientmountd[32052]:         /builddir/build/BUILD/dws-clientmount-1.0~beta4/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320
Feb 14 14:13:26 hetchy29 clientmountd[32052]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
Feb 14 14:13:26 hetchy29 clientmountd[32052]:         /builddir/build/BUILD/dws-clientmount-1.0~beta4/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273
Feb 14 14:13:26 hetchy29 clientmountd[32052]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
Feb 14 14:13:26 hetchy29 clientmountd[32052]:         /builddir/build/BUILD/dws-clientmount-1.0~beta4/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234

but there are NVMe namespaces still attached.

After debugging this a bit further this is almost true. I believe we've only seen this teardown issue after canceling a workflow which was never mounted on the compute. The client mount daemon cannot complete because while the NVMe namespaces are visible some, or all, of them will report a size of zero bytes. Manually issuing a nvme ns-rescan does appear to allow the node to correctly detect all of the namespaces and then mount. For example:

Node                   Namespace Usage                      Format           FW Rev  
---------------------  --------- -------------------------- ---------------- --------
/dev/nvme0n1           1         973.40  MB /   1.92  TB    512   B +  0 B   GDC7102Q         
/dev/nvme1n1           1          33.79  MB /  68.72  GB      4 KiB +  0 B   S103              
/dev/nvme10n1          1          33.76  MB /  68.72  GB      4 KiB +  0 B   S103              
/dev/nvme11n1          1           0.00   B /   0.00   B      1   B +  0 B   S103               <<<<<<
/dev/nvme12n1          1          33.76  MB /  68.72  GB      4 KiB +  0 B   S103              
/dev/nvme13n1          1          33.76  MB /  68.72  GB      4 KiB +  0 B   S103              
/dev/nvme14n1          1          33.76  MB /  68.72  GB      4 KiB +  0 B   S103              
/dev/nvme15n1          1          33.76  MB /  68.72  GB      4 KiB +  0 B   S103              
/dev/nvme16n1          1          33.79  MB /  68.72  GB      4 KiB +  0 B   S103              
/dev/nvme17n1          1         965.21  MB /   1.92  TB    512   B +  0 B   GDC7102Q         
/dev/nvme2n1           2          33.78  MB /  68.72  GB      4 KiB +  0 B   S103              
/dev/nvme3n1           1          33.79  MB /  68.72  GB      4 KiB +  0 B   S103              
/dev/nvme4n1           1          33.79  MB /  68.72  GB      4 KiB +  0 B   S103              
/dev/nvme5n1           1          33.76  MB /  68.72  GB      4 KiB +  0 B   S103              
/dev/nvme6n1           2          33.79  MB /  68.72  GB      4 KiB +  0 B   S103              
/dev/nvme7n1           1          33.76  MB /  68.72  GB      4 KiB +  0 B   S103              
/dev/nvme8n1           1          33.79  MB /  68.72  GB      4 KiB +  0 B   S103              
/dev/nvme9n1           1          33.73  MB /  68.72  GB      4 KiB +  0 B   S103

Then after a nvme ns-rescan /dev/nvme11 the namespace is updated with the correct size and the mount proceeds correctly.

/dev/nvme11n2          2          33.76  MB /  68.72  GB      4 KiB +  0 B   S103

Thus far we have only seen this behavior on one of the two Bard Peak blades. The other appears to reliably detect changes to the namespaces.

Latest investigation on EAS3 system. On a subset of the Bard Peak blades, the namespaces don't appear on the node after the Rabbit attaches them. Once a BP node fails to discover attached namespaces, subsequent workflows also fail. Note: not all BP nodes show this behavior, however.

Failure Scenario

Workflow progresses to PreRun state, Rabbit creates the storage group for the BP node which attaches the namespaces to the that node.
clientmountd on the Bard Peak recognizes that it should have a set of namespaces attached from the clientmount k8s resource, so it searches for /dev/nvme<#>n<#> devices, but it can't find them. They haven't yet been created on Bard Peak
At this point, the workflow is failed because the filesystem cannot be mounted on the Bard Peak. 35.164s: job.exception type=exception severity=0 DWS/Rabbit interactions failed: workflow in 'Error' state too long: DW Directive 0: Could not access file system on nodes

Digging deeper On the BP node, if nvme ns-rescan is issued after the point where the Rabbit has created the storage group, 'usually' the namespaces appear and clientmountd can proceed. In a case yesterday however, the nvme ns-rescan did not cause the namespaces to be discovered and created on the BP node.

Scenario beginning at the point where the namespaces failed to appear

nvme ns-rescan fails to discover and create /dev/nvme<#>n<#> devices corresponding to the attached namespaces Rabbit assigned.
unload/reload nvme kmods (no effect)
~10 minutes passed
stop clientmountd to prevent it from immediately deleting the namespaces should they appear. Recall that the workflow is now in a failure state, so clientmountd is expected to perform teardown actions.
nvme ns-rescan this time discovers and created the /dev/nvme<#>n<#> devices
restart clientmountd
clientmountd finds the namespaces and assembles and mounts the filesystem

As an experiment the following successful scenario correctly mounted the filesystem:

stop clientmountd
submit workflow from flux
wait for PreRun stage
manually issue nvme ns-rescan on the BP node
restart clientmountd
everything mounts correctly on the BP
workflow completed the PreRun stage successfully

Questions:

Why didn’t the namespaces show up in the first place (I think we know we need to force a ns-rescan for this to be reliable)?
Why didn’t the namespaces show up in the first ns-rescan?
What happened in the time between the first ns-rescan and the second for the second ns-rescan to work?

It seems clear that clientmountd must issue a nvme ns-rescan operation at least in the cases where the namespaces should be present but aren't. It is not clear of what the downside is of issuing the ns-rescan is, if any.

I modified Clientmountd to run nvme ns-rescan when it can't find the devices.

NearNodeFlash / NearNodeFlash.github.io

Workflows stuck - namespaces not appearing on node; ns-rescan recovers #35