NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
3 stars 3 forks source link

Consider removing nnfdatamovement finalizers after completion. #75

Closed bdevcich closed 6 months ago

bdevcich commented 1 year ago

nnfdatamovements can block the undeployment of nnf-dm due to the finalizers on these resources. Consider removing the finalizers when the data movement operation is complete.

That way, I think an undeploy of nnf-dm will delete these rather than get stuck.

@roehrich-hpe - thoughts?

roehrich-hpe commented 1 year ago

You referenced this file: https://github.com/NearNodeFlash/nnf-container-example/blob/master/nnf-container-example.yaml when I asked what Marty was doing with his "kubectl delete". If you run "kubectl delete" on that file, you'll rip out the NnfContainerProfile from workflow that is actively using it. That would be bad. And you'll do a straight-up delete of the Workflow resource, without first putting it into Teardown state; that would be bad as well. That's why you had to do so much manual cleanup last week.

bdevcich commented 1 year ago

I agree on that. But in that case, it's container related whereas this is referring to an undeploy of nnf-dm. I'll do some more testing, but I think ifNnfDataMovements present, the undeploy of nnf-dm will get stuck until those NnfDataMovements are removed.

roehrich-hpe commented 1 year ago

Those would not be present if the workflow had been through teardown, and had been deleted, right?

bdevcich commented 1 year ago

Those would not be present if the workflow had been through teardown, and had been deleted, right?

No. Even with properly torn down and removed workflows, the undeployment of nnf-dm will still hang if NnfDatamovements are present. These DMs are created via the Copy Offload API, so the workflow doesn't clean them up on teardown. It's the user of the copy offload API to clean up and remove the NnfDatamovements. Then it seems like the namespace is stuck terminating since those NnfDatamovements are in that namespace.

➜  ~ kubectl get workflows -A
No resources found
➜  ~ kubectl get nnfdatamovements -A
NAMESPACE       NAME           STATE      STATUS    AGE
nnf-dm-system   nnf-dm-pcg5h   Finished   Success   6d20h
nnf-dm-system   nnf-dm-f2vq7   Finished   Success   6d20h
nnf-dm-system   nnf-dm-6zq8f   Finished   Success   6d20h
➜  ~ kubectl get ns | grep nnf
nnf-lustre-fs-system      Active        27d
nnf-system                Active        13d
nnf-system-needs-triage   Active        6d23h
nnf-dm-system             Terminating   6d20h
bdevcich commented 1 year ago

Looks like the worker pods get stuck too, I'm wondering if it's because of the NnfDatamovements:

 $ k get all -n nnf-dm-system
NAME                      READY   STATUS        RESTARTS   AGE
pod/nnf-dm-worker-lh7xb   0/2     Terminating   0          128m
pod/nnf-dm-worker-cc8pj   0/2     Terminating   0          128m
roehrich-hpe commented 1 year ago

Looks like the worker pods get stuck too, I'm wondering if it's because of the NnfDatamovements:

 $ k get all -n nnf-dm-system
NAME                      READY   STATUS        RESTARTS   AGE
pod/nnf-dm-worker-lh7xb   0/2     Terminating   0          128m
pod/nnf-dm-worker-cc8pj   0/2     Terminating   0          128m

What does the metadata section look like for those?

bdevcich commented 6 months ago

Fixed via: