NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
3 stars 3 forks source link

Abnormal workflow termination can orphan NVMe namespaces #148

Open ajfloeder opened 2 months ago

ajfloeder commented 2 months ago

Abnormally terminated workflows can fail to cleanup nvme namespaces.

Documenting the symptom here. Not yet sure of the root cause(s).

Cleanup method

Orphaned NVMe Namespaces? If all of your workflows have completed, you can check a particular rabbit to determine if it has orphaned NVMe namespaces by:

~/tools/nvme.sh list

If there are namespaces listed there, they are orphaned.

The easy way to delete these namespaces is:

  1. delete the nnfnodeecdata resource for the Rabbit in question
  2. delete the nnf-node-manager pod for the Rabbit in question

The nnf-node-manager pod will restart automatically. Because its nnfnodeecdata resource has been removed, it will cleanup all existing namespaces during initialization..