NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
3 stars 3 forks source link

Clientmount fails to Teardown after mount failure #134

Open ajfloeder opened 4 months ago

ajfloeder commented 4 months ago

In a scenario where an XFS workflow fails to come ready during the PreRun state after a suitable time because the clientmount resource fails to successfully mount the filesystem, flux puts the workflow into the Teardown state. A case was seen where lvmlockd was having issues, thus the clientmount for the compute node never mounted the filesystem. When the workflow entered the Teardown state, the clientmount resource failed to delete, even though the filesystem had never been successfully mounted.

Here is a snip of the clientmount resource that failed to teardown. Removing the finalizer in this case caused the resource to delete successfully.

<snip>
Spec:
  Desired State:  mounted
  Mounts:
    Device:
      Device Reference:
        Object Reference:
          Kind:       NnfNodeStorage
          Name:       default-fluxjob-432679157810333696-0-xfs-0
          Namespace:  <removed-computenode>
      Lvm:
        Device Type:  nvme
      Type:           lvm
    Mount Path:       /mnt/nnf/7b688d96-7893-4a78-a88e-c966603dea97-0
    Options:
    Set Permissions:  false
    Target Type:      directory
    Type:             xfs
  Node:               <removed-computenode>
Status:
  Error:
    Debug Message:  unable to unmount file system: could not deactivate block device after unmount /mnt/nnf/7b688d96-7893-4a78-a88e-c966603dea97-0: command: vgchange --lock-stop c6681c07-d650-4d9d-aae6-8376a03b3e53_0_0 - stderr:   WARNING: lvmlockd process is not running.
  Reading VG c6681c07-d650-4d9d-aae6-8376a03b3e53_0_0 without a lock.
  Command failed with status code 5.
 - stdout:  - error: exit status 5
    Severity:  Major
    Type:      Internal
  Mounts:
    Ready:  false
    State:  mounted
Events:     <none>