NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
3 stars 3 forks source link

Flux picks computes that are attached to missing rabbits #159

Open bdevcich opened 1 month ago

bdevcich commented 1 month ago

Attempting to run a workflow on el cap:

[devcich1@elcap1:system-test]$ N=1 Q=iotesting make sanity bats -j 1 --filter-tags tag:sanity . ./system-test.bats ✗ XFS tags: tag:sanity tag:simple tag:xfs (in test file ./system-test.bats, line 49) `#DW jobdw type=xfs name=xfs capacity=50GB" \' failed 16.026s: job.exception type=dws-setup severity=0 DWS workflow interactions failed: 'XXX' 16.079s: job.exception type=prolog severity=0 prolog killed by signal 15 (timeout or job canceled)

1 test, 1 failure In this case, I'm using system test to create a simple xfs workflow, but this behavior is the same for any filesystem type. It's effectively running: flux run -l -N${N} --wait-event=clean -q iotesting --setattr=dw="#DW jobdw type=xfs name=xfs capacity=50GB"

The workflow is going from Proposal directly to Teardown. I do not believe the workflow itself throws an error when watching changes to the workflow with:

kubectl get workflows -w -A -oyaml | grep -i error No errors appear in the output.

Tracing the the compute node to find its rabbit node and the rabbit node is not in the cluster:

[devcich1@elcap1:~]$ kubectl get node XXX Error from server (NotFound): nodes "XXX" not found

[devcich1@elcap1:~]$ kubectl get nnfnodes -n elcapXXX No resources found in elcapXXX namespace.

jameshcorbett commented 1 month ago

There's already a fix for this (I'm fairly sure) but the rest of the flux team doesn't want me to put it in place yet on elcap, because they're working on sorting out some other issues. One thing you can do to avoid it (inconvenient I know sorry) is to force flux to choose specific compute nodes with flux run --requires=hosts:elcap[12-15] or similar.