Open behlendorf opened 1 year ago
Digging a bit deeper the clientmountd logged this error on the compute node. Additionally, the logical volumes are in fact not reported by lvs
on the compute, but there are NVMe namespaces still attached.
Feb 14 14:13:26 hetchy29 clientmountd[32052]: 1.6764128060150645e+09 INFO controllers.ClientMount Unmounting all file systems due to resource deletion {"ClientMount": "hetchy29/default-fluxjob-167223879540933632-0-computes"}
Feb 14 14:13:26 hetchy29 clientmountd[32052]: 1.6764128062876759e+09 INFO controllers.ClientMount Could not find VG/LV pair default-fluxjob-167223879540933632-0-xfs-0-0_168a260a-4b09-40e9-89ad-678f740e8d43/lv:
Feb 14 14:13:26 hetchy29 clientmountd[32052]: 1.6764128062877223e+09 ERROR controllers.ClientMount Could not deactivate LVM volume {"ClientMount": "hetchy29/default-fluxjob-167223879540933632-0-computes", "mount path": "/mnt/nnf/168a260a-4b09-40e9-89ad-678f740e8d43-0", "error": "Could not find VG/LV pair default-fluxjob-167223879540933632-0-xfs-0-0_168a260a-4b09-40e9-89ad-678f740e8d43/lv: "}
Feb 14 14:13:26 hetchy29 clientmountd[32052]: github.com/HewlettPackard/dws/mount-daemon/controllers.(*ClientMountReconciler).unmountAll
Feb 14 14:13:26 hetchy29 clientmountd[32052]: /builddir/build/BUILD/dws-clientmount-1.0~beta4/mount-daemon/controllers/clientmount_controller.go:154
Feb 14 14:13:26 hetchy29 clientmountd[32052]: github.com/HewlettPackard/dws/mount-daemon/controllers.(*ClientMountReconciler).Reconcile
Feb 14 14:13:26 hetchy29 clientmountd[32052]: /builddir/build/BUILD/dws-clientmount-1.0~beta4/mount-daemon/controllers/clientmount_controller.go:86
Feb 14 14:13:26 hetchy29 clientmountd[32052]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
Feb 14 14:13:26 hetchy29 clientmountd[32052]: /builddir/build/BUILD/dws-clientmount-1.0~beta4/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121
Feb 14 14:13:26 hetchy29 clientmountd[32052]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
Feb 14 14:13:26 hetchy29 clientmountd[32052]: /builddir/build/BUILD/dws-clientmount-1.0~beta4/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320
Feb 14 14:13:26 hetchy29 clientmountd[32052]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
Feb 14 14:13:26 hetchy29 clientmountd[32052]: /builddir/build/BUILD/dws-clientmount-1.0~beta4/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273
Feb 14 14:13:26 hetchy29 clientmountd[32052]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
Feb 14 14:13:26 hetchy29 clientmountd[32052]: /builddir/build/BUILD/dws-clientmount-1.0~beta4/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234
but there are NVMe namespaces still attached.
After debugging this a bit further this is almost true. I believe we've only seen this teardown issue after canceling a workflow which was never mounted on the compute. The client mount daemon cannot complete because while the NVMe namespaces are visible some, or all, of them will report a size of zero bytes. Manually issuing a nvme ns-rescan
does appear to allow the node to correctly detect all of the namespaces and then mount. For example:
Node Namespace Usage Format FW Rev
--------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 1 973.40 MB / 1.92 TB 512 B + 0 B GDC7102Q
/dev/nvme1n1 1 33.79 MB / 68.72 GB 4 KiB + 0 B S103
/dev/nvme10n1 1 33.76 MB / 68.72 GB 4 KiB + 0 B S103
/dev/nvme11n1 1 0.00 B / 0.00 B 1 B + 0 B S103 <<<<<<
/dev/nvme12n1 1 33.76 MB / 68.72 GB 4 KiB + 0 B S103
/dev/nvme13n1 1 33.76 MB / 68.72 GB 4 KiB + 0 B S103
/dev/nvme14n1 1 33.76 MB / 68.72 GB 4 KiB + 0 B S103
/dev/nvme15n1 1 33.76 MB / 68.72 GB 4 KiB + 0 B S103
/dev/nvme16n1 1 33.79 MB / 68.72 GB 4 KiB + 0 B S103
/dev/nvme17n1 1 965.21 MB / 1.92 TB 512 B + 0 B GDC7102Q
/dev/nvme2n1 2 33.78 MB / 68.72 GB 4 KiB + 0 B S103
/dev/nvme3n1 1 33.79 MB / 68.72 GB 4 KiB + 0 B S103
/dev/nvme4n1 1 33.79 MB / 68.72 GB 4 KiB + 0 B S103
/dev/nvme5n1 1 33.76 MB / 68.72 GB 4 KiB + 0 B S103
/dev/nvme6n1 2 33.79 MB / 68.72 GB 4 KiB + 0 B S103
/dev/nvme7n1 1 33.76 MB / 68.72 GB 4 KiB + 0 B S103
/dev/nvme8n1 1 33.79 MB / 68.72 GB 4 KiB + 0 B S103
/dev/nvme9n1 1 33.73 MB / 68.72 GB 4 KiB + 0 B S103
Then after a nvme ns-rescan /dev/nvme11
the namespace is updated with the correct size and the mount proceeds correctly.
/dev/nvme11n2 2 33.76 MB / 68.72 GB 4 KiB + 0 B S103
Thus far we have only seen this behavior on one of the two Bard Peak blades. The other appears to reliably detect changes to the namespaces.
Latest investigation on EAS3 system. On a subset of the Bard Peak blades, the namespaces don't appear on the node after the Rabbit attaches them. Once a BP node fails to discover attached namespaces, subsequent workflows also fail. Note: not all BP nodes show this behavior, however.
Failure Scenario
PreRun
state, Rabbit creates the storage group for the BP node which attaches the namespaces to the that node./dev/nvme<#>n<#>
devices, but it can't find them. They haven't yet been created on Bard Peak35.164s: job.exception type=exception severity=0 DWS/Rabbit interactions failed: workflow in 'Error' state too long: DW Directive 0: Could not access file system on nodes
Digging deeper
On the BP node, if nvme ns-rescan
is issued after the point where the Rabbit has created the storage group, 'usually' the namespaces appear and clientmountd can proceed.
In a case yesterday however, the nvme ns-rescan
did not cause the namespaces to be discovered and created on the BP node.
Scenario beginning at the point where the namespaces failed to appear
nvme ns-rescan
fails to discover and create /dev/nvme<#>n<#>
devices corresponding to the attached namespaces Rabbit assigned.nvme ns-rescan
this time discovers and created the /dev/nvme<#>n<#>
devicesAs an experiment the following successful scenario correctly mounted the filesystem:
nvme ns-rescan
on the BP nodeQuestions:
It seems clear that clientmountd must issue a nvme ns-rescan
operation at least in the cases where the namespaces should be present but aren't. It is not clear of what the downside is of issuing the ns-rescan
is, if any.
I modified Clientmountd to run nvme ns-rescan
when it can't find the devices.
Submitting 50 batch jobs with Flux each of which requests an
xfs
filesystem andcopy_in
resulted in two of the workflows getting stuck in teardown. The other 48 workflows do appear to have run correctly. The stuck workflows contained the following#DW
directives (as did all of them) and are namedfluxjob-167223879540933632
andfluxjob-167224799486018560
.We see the following error logged for both workflows. However, it looks like this error is usually transient since it's also logged for several of the other workflows which did complete.
Both logical volumes do exist on the rabbit-p but are not mounted.
Inspecting the node manager logs seems to also show there may be some other issue.