Closed bdevcich closed 6 months ago
You referenced this file: https://github.com/NearNodeFlash/nnf-container-example/blob/master/nnf-container-example.yaml when I asked what Marty was doing with his "kubectl delete". If you run "kubectl delete" on that file, you'll rip out the NnfContainerProfile from workflow that is actively using it. That would be bad. And you'll do a straight-up delete of the Workflow resource, without first putting it into Teardown state; that would be bad as well. That's why you had to do so much manual cleanup last week.
I agree on that. But in that case, it's container related whereas this is referring to an undeploy of nnf-dm. I'll do some more testing, but I think ifNnfDataMovements
present, the undeploy of nnf-dm will get stuck until those NnfDataMovements
are removed.
Those would not be present if the workflow had been through teardown, and had been deleted, right?
Those would not be present if the workflow had been through teardown, and had been deleted, right?
No. Even with properly torn down and removed workflows, the undeployment of nnf-dm will still hang if NnfDatamovements are present. These DMs are created via the Copy Offload API, so the workflow doesn't clean them up on teardown. It's the user of the copy offload API to clean up and remove the NnfDatamovements. Then it seems like the namespace is stuck terminating since those NnfDatamovements are in that namespace.
➜ ~ kubectl get workflows -A
No resources found
➜ ~ kubectl get nnfdatamovements -A
NAMESPACE NAME STATE STATUS AGE
nnf-dm-system nnf-dm-pcg5h Finished Success 6d20h
nnf-dm-system nnf-dm-f2vq7 Finished Success 6d20h
nnf-dm-system nnf-dm-6zq8f Finished Success 6d20h
➜ ~ kubectl get ns | grep nnf
nnf-lustre-fs-system Active 27d
nnf-system Active 13d
nnf-system-needs-triage Active 6d23h
nnf-dm-system Terminating 6d20h
Looks like the worker pods get stuck too, I'm wondering if it's because of the NnfDatamovements
:
$ k get all -n nnf-dm-system
NAME READY STATUS RESTARTS AGE
pod/nnf-dm-worker-lh7xb 0/2 Terminating 0 128m
pod/nnf-dm-worker-cc8pj 0/2 Terminating 0 128m
Looks like the worker pods get stuck too, I'm wondering if it's because of the
NnfDatamovements
:$ k get all -n nnf-dm-system NAME READY STATUS RESTARTS AGE pod/nnf-dm-worker-lh7xb 0/2 Terminating 0 128m pod/nnf-dm-worker-cc8pj 0/2 Terminating 0 128m
What does the metadata section look like for those?
nnfdatamovements can block the undeployment of nnf-dm due to the finalizers on these resources. Consider removing the finalizers when the data movement operation is complete.
That way, I think an undeploy of nnf-dm will delete these rather than get stuck.
@roehrich-hpe - thoughts?