In a scenario where an XFS workflow fails to come ready during the PreRun state after a suitable time because the clientmount resource fails to successfully mount the filesystem, flux puts the workflow into the Teardown state. A case was seen where lvmlockd was having issues, thus the clientmount for the compute node never mounted the filesystem. When the workflow entered the Teardown state, the clientmount resource failed to delete, even though the filesystem had never been successfully mounted.
Here is a snip of the clientmount resource that failed to teardown.
Removing the finalizer in this case caused the resource to delete successfully.
<snip>
Spec:
Desired State: mounted
Mounts:
Device:
Device Reference:
Object Reference:
Kind: NnfNodeStorage
Name: default-fluxjob-432679157810333696-0-xfs-0
Namespace: <removed-computenode>
Lvm:
Device Type: nvme
Type: lvm
Mount Path: /mnt/nnf/7b688d96-7893-4a78-a88e-c966603dea97-0
Options:
Set Permissions: false
Target Type: directory
Type: xfs
Node: <removed-computenode>
Status:
Error:
Debug Message: unable to unmount file system: could not deactivate block device after unmount /mnt/nnf/7b688d96-7893-4a78-a88e-c966603dea97-0: command: vgchange --lock-stop c6681c07-d650-4d9d-aae6-8376a03b3e53_0_0 - stderr: WARNING: lvmlockd process is not running.
Reading VG c6681c07-d650-4d9d-aae6-8376a03b3e53_0_0 without a lock.
Command failed with status code 5.
- stdout: - error: exit status 5
Severity: Major
Type: Internal
Mounts:
Ready: false
State: mounted
Events: <none>
In a scenario where an XFS workflow fails to come ready during the
PreRun
state after a suitable time because the clientmount resource fails to successfully mount the filesystem,flux
puts the workflow into theTeardown
state. A case was seen wherelvmlockd
was having issues, thus the clientmount for the compute node never mounted the filesystem. When the workflow entered theTeardown
state, the clientmount resource failed to delete, even though the filesystem had never been successfully mounted.Here is a snip of the clientmount resource that failed to teardown. Removing the finalizer in this case caused the resource to delete successfully.