NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
3 stars 3 forks source link

Race between `persistentdw` and `destroy_persistent` #169

Open bdevcich opened 1 week ago

bdevcich commented 1 week ago

With system-test where workflows are running in parallel (J>1), it can hit a case where the persistent usage tests are racing with the destroy case. What happens is that the workflow using the persistent storage can't finish PreRun because the destroy workflow beats it. Then both workflows are stuck until the usage workflow is removed.

So we end up with:

$ kubectl get workflows
NAME                         STATE      READY   STATUS       JOBID         AGE
fluxjob-172781307426766848   PreRun     false   DriverWait   fQGBxxiWiv3   30m
fluxjob-172781996030820352   Teardown   false   Error        fQGCH3r6VR9   29m

NnfAccess for the usage workflow says:

status:
  error:
    debugMessage: 'unable to create ClientMount resources: ClientMount.dataworkflowservices.github.io
      "default-fluxjob-172781307426766848-0-computes" is invalid: spec.mounts: Invalid
      value: 0: spec.mounts in body should have at least 1 items'
    severity: Minor
    type: Internal
    userMessage: unable to mount file system on client nodes

There's no ClientMount yet.

The destroy workflow says:

  message: 'DW Directive 0: User error: persistent storage cannot be deleted while
    in use'
  ready: false
  readyChange: "2024-06-20T20:09:44.290609Z"
  state: Teardown
  status: Error

Could the destroy proposal check the directivebreakdowns for any use of the directive name before it can leave Proposal? That way, as long as any usage workflow is out of proposal, there should be a directive breakdown that contains the persistent name:

  directive: '#DW persistentdw name=persistent-xfs-7c8c30f2'

Then the destroy can't get out of proposal until there are no directivebreakdowns left that contain that persistent name.

bdevcich commented 1 week ago

fluxjob-172781307426766848.zip fluxjob-172781996030820352.zip

bdevcich commented 5 days ago

We determined that this is most likely a flux issue and there is no race condition here.

What is happening is that flux is picking different computes and those computes are not tied to the same rabbit that has the persistent filesystem.

So if persistent gfs2 is created on rabbit-0 and then a compute from rabbit-1 tries to use it, it cannot mount the filesystem on the compute.

This does not appear to be an issue with lustre since I can be mounted from anywhere.