Wait for all drain scripts to finish

cloudfoundry-incubator / quarks-operator

BOSH releases deployed on Kubernetes

https://www.cloudfoundry.org/project-quarks/

Apache License 2.0

49 stars 35 forks source link

Wait for all drain scripts to finish #1302

Closed manno closed 3 years ago

manno commented 3 years ago

Motivation and Context

This adds a loop to wait for all other bpm containers, after the drain script has finished.

#177254980

This draft adds the shared empty dir also to the init containers, even though they don't need it.

Fixes https://github.com/cloudfoundry-incubator/quarks-operator/issues/1297

jandubois commented 3 years ago

Hi @manno!

I see you have merged this PR; are you going to make a quarks release with this change, or are you waiting for it to be tested from a dev build before you commit to a new release?

I'm hoping to see some confirmation from @univ0298 that it works as expected.

univ0298 commented 3 years ago

I was waiting to see a release but if that's not coming please let me know @manno thanks!

manno commented 3 years ago

Either way is fine for me :) I'll create a release then.

manno commented 3 years ago

Ok, interesting. Seems like the helm chart artifact from CI (from https://github.com/cloudfoundry-incubator/quarks-operator/actions/runs/732318599) is not public.

The only thing visible are the docker images: https://github.com/users/cfcontainerizationbot/packages/container/package/quarks-operator-dev

The release might take a till tomorrow, I'll attach the dev helm chart here: helm chart.zip

manno commented 3 years ago

I just released https://github.com/cloudfoundry-incubator/quarks-operator/releases/tag/v7.2.2-0.g20bcb4c

univ0298 commented 3 years ago

@manno I don't think this is working. What I'm seeing is that if I set the terminationGracePeriod high to allow for drains to complete, the drains complete but still the pod runs, and only when the grace period is exhausted will it terminate the pod. So it seems we have ended up with waiting forever (well until the grace period limit) instead of detecting that all drains are complete. Is there any way I can try to debug things?

Here is what it looks like in the pod after all the drains have ended:

/:/var/vcap/jobs/garden# ls -latR /mnt/
/mnt/:
total 8
drwxr-xr-x 1 root root 4096 Apr 19 19:29 .
drwxr-xr-x 1 root root 4096 Apr 19 19:29 ..
drwxrwsrwt 2 root adm    40 Apr 19 19:25 drain-stamps

/mnt/drain-stamps:
total 4
drwxr-xr-x 1 root root 4096 Apr 19 19:29 ..
drwxrwsrwt 2 root adm    40 Apr 19 19:25 .

univ0298 commented 3 years ago

Discussed this with @manno yesterday. The most obvious issue is that there is a mixup in the current code where it's writing to /mnt/drain-done but it's mounted /tmp/drain-stamps

However there are other issues as well, working through them with @manno