NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
3 stars 3 forks source link

Workflow stuck in `PreRun` #145

Open ajfloeder opened 3 months ago

ajfloeder commented 3 months ago

Scenario:

  1. create workflow in flux
  2. flux progresses the workflow to the PreRun state where it never comes ready.

Rabbit: tioga102 Compute: tioga39

$ grep tioga39 /etc/coral2/xhost_mapping 
tioga39 x1000c1s6b1n0

Looking at the nnf-system_nnf-node-manager log for the time in question, we see the filesystem successfully created, but there is never an attempt to attach the namespaces to tioga39. Clientmount resource is created for tioga39, however. This would indicate that the nnfnodeblockstorage controller believed it had attached the namespaces to the compute node. There were no failures in the log to indicate that it had attempted to map things and failed.

Digging into the configuration a bit deeper, we see the the PCIe status of the link to tioga39 was actually offline when this workflow was stuck.

# /admin/scripts/nnf/switch.sh status
Execute switch status on /dev/switchtec0
DEVICE: /dev/switchtec0 PAX_ID: 1

Switch Connection           Status
=========================== ======
Interswitch Link            UP
Drive Slot 4                UP
Drive Slot 5                UP
Drive Slot 6                UP
Drive Slot 2                UP
Drive Slot 1                DOWN
Drive Slot 9                UP
Drive Slot 10               UP
Drive Slot 11               UP
Drive Slot 3                UP
Rabbit,       x9000c?j7b0   UP
Compute 8,    x9000c?s4b0n0 DOWN
Compute 9,    x9000c?s4b1n0 DOWN
Compute 10,   x9000c?s5b0n0 DOWN
Compute 11,   x9000c?s5b1n0 DOWN
Compute 12,   x9000c?s6b0n0 UP
Compute 13,   x9000c?s6b1n0 DOWN   <<<< tioga39
Compute 14,   x9000c?s7b0n0 DOWN
Compute 15,   x9000c?s7b1n0 DOWN

Execute switch status on /dev/switchtec1
DEVICE: /dev/switchtec1 PAX_ID: 0

Switch Connection           Status
=========================== ======
Interswitch Link            UP
Drive Slot 8                UP
Drive Slot 7                UP
Drive Slot 15               UP
Drive Slot 16               UP
Drive Slot 17               UP
Drive Slot 18               UP
Drive Slot 14               UP
Drive Slot 13               DOWN
Drive Slot 12               UP
Rabbit,       x9000c?j7b0   UP
Compute 0,    x9000c?s0b0n0 DOWN
Compute 1,    x9000c?s0b1n0 UP
Compute 2,    x9000c?s1b0n0 UP
Compute 3,    x9000c?s1b1n0 UP
Compute 4,    x9000c?s2b0n0 UP
Compute 5,    x9000c?s2b1n0 UP
Compute 6,    x9000c?s3b0n0 UP
Compute 7,    x9000c?s3b1n0 UP

$ kubectl get storages.dataworkflowservices.github.io tioga102 -o yaml
apiVersion: dataworkflowservices.github.io/v1alpha2
kind: Storage
metadata:
  creationTimestamp: "2024-03-12T23:56:54Z"
  generation: 1
  labels:
    dataworkflowservices.github.io/storage: Rabbit
  name: tioga102
  namespace: default
  resourceVersion: "63456188"
  uid: 4552b44e-1e68-416a-9576-d8597d20f4d0
spec:
  state: Enabled
status:
  access:
    computes:
    - name: tioga26
      status: Offline
    - name: tioga27
      status: Ready
    - name: tioga28
      status: Ready
    - name: tioga29
      status: Ready
    - name: tioga30
      status: Ready
    - name: tioga31
      status: Ready
    - name: tioga32
      status: Ready
    - name: tioga33
      status: Ready
    - name: tioga34
      status: Offline
    - name: tioga35
      status: Offline
    - name: tioga36
      status: Offline
    - name: tioga37
      status: Offline
    - name: tioga38
      status: Ready
    - name: tioga39     <<<<<<<<<<<<<<< tioga39
      status: Offline   <<<<<<<<<<<<<<< Offline
    - name: tioga40
      status: Offline
    - name: tioga41
      status: Offline
    protocol: PCIe
    servers:
    - name: tioga102
      status: Ready
  capacity: 2965239273881

We confirmed with lspci -PP | grep KIO that tioga39 had no PCI connections to the drives.

We decided to reboot tioga39 to see if the PCI connections could be restored. Sure enough, after tioga39 rebooted, the link status was restored. Good news!

However, the workflow still stayed in PreRun Ready==false. Looking at the storages.dataworkflowservices.github.io tioga102 resource, tioga39s status had not changed. We restarted nnf-node-manager POD on tioga102 which caused the storages.dataworkflowservices.github.io tioga102 resource to be updated. Once we did that, the workflow successfully completed the PreRun state and proceeded all the way through Teardown.

Issues:

  1. Why didn't the nnf-node-manager log show something when it either attempted to attach the namespaces to tioga39 and failed OR it just skipped tioga39 because it was offline.
  2. Why wasn't the storages.dataworkflowservices.github.io tioga102 resource updated when tioga39 rebooted and its PCIe link was restored.
  3. Why did flux allow a job to be run using tioga39 when that compute resource was offline.
  4. Should the workflow sit there waiting for a compute node, or fail if it is assigned a compute node that is offline at the time it attempts the PreRun stage?