BOSH drain does not work due to calling the wrong script location

univ0298 commented 3 years ago

Describe the bug This commit from June 2019 added drain support https://github.com/cloudfoundry-incubator/quarks-operator/commit/870f9a49343567e4fc1d7a6262fba56bfa52fc7c but it doesn't work because it is calling the wrong location for drain scripts! It is calling /var/vcap/jobs/<jobName>/bin/drain/* but BOSH drain scripts are not put in a subdirectory like this, they are always a script called drain in location /var/vcap/jobs/<jobName>/bin/drain. As a result this entire drain support doesn't work

To Reproduce Simple steps - kubectl delete pod <your-pod> on any pod that should have drain, and you will see the drain doesn't happen. For example kubectl delete pod diego-cell-0 will instantly cause the cell pod to terminate and all the app instances will go offline without the proper Diego evacuate/drain mechanism that keeps app instances alive without downtime. This means in a complete scenario that upgrading KubeCF is a totally disruptive experience to users since the entire graceful drain of cells doesn't occur. There are other KubeCF jobs that also have drain for a more graceful non-disruptive maintenance as well, but the cell is the most impactful and high profile.

Expected behavior When a pod is deleted, the preStop should actually call the drain scripts in their correct location, instead of calling the wrong path!

Environment

CF-Operator Version 7.1.3
KubeCF 7.1.2

Additional context Original discussion and discovery of this problem was in KubeCF in this issue: https://github.com/cloudfoundry-incubator/kubecf/issues/1246

Here is the problem code: https://github.com/cloudfoundry-incubator/quarks-operator/blob/v7.1.3/pkg/bosh/bpmconverter/container_factory.go#L611-L650

This code in particular:

    // Setup the job drain script handler.
    drainGlob := filepath.Join(VolumeJobsDirMountPath, jobName, "bin", "drain", "*")

is building a drain location of /var/vcap/jobs//bin/drain/* but that is NOT where drain scripts are in BOSH. There is no such subdirectory of drain and instead the drain scripts are actually at /var/vcap/jobs//bin/drain (the script itself is called drain and there's no subdirectory).

So it seems that this likely has never worked and never been tested or used by anyone.

cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/176962714

The labels on this github issue will be updated when the story is started.

manno commented 3 years ago

Uh, there is a lot of history here:

https://www.pivotaltracker.com/story/show/175084409 (story to add a drain controller and finalizer)

However what you describe sounds a lot like this: https://github.com/cloudfoundry-incubator/quarks-operator/issues/854#issuecomment-615247705

Would it be sufficient to put the drain scripts into that directory on the KubeCF side? Should Quarks not support multiple scripts? Is anyone using this?

Or is it really necessary to implement drain with a new controller?

univ0298 commented 3 years ago

Thanks @manno . I think there's two items mingled together here so let me try to clarify their separation.

1) There is the actual location of the drain. cf-operator is simply doing it wrong by calling /var/vcap/jobs/<jobName>/bin/drain/* because in BOSH I am pretty much certain that the drain must be at /var/vcap/jobs/<jobName>/bin/drain . The docs aren't great but it's covered here: https://bosh.io/docs/drain/ and in my experience and checks on CF-Deployment, that's where they all are. So I don't think there should be any discussion about supporting multiple or moving those to a different location. I think the path just needs to be updated..

2) How to actually make sure Kubernetes allows the drain to complete - this is all about termination grace, finalizers, etc - see https://github.com/cloudfoundry-incubator/kubecf/issues/1246 where I discussed that already and found that the termination grace period seems to accomplish what we need so I don't know whether anything more is needed like you referenced here https://www.pivotaltracker.com/story/show/175084409

univ0298 commented 3 years ago

Actually @manno I see there's some circular links above... let me explain..

I opened this issue in follow on from the kubecf issue 1246 https://github.com/cloudfoundry-incubator/kubecf/issues/1246 as mentioned in my comment above and also in the original beginning of this issue. If you follow to that link you'll see it was that issue which actually became https://www.pivotaltracker.com/story/show/175084409 which you linked to in your comment above! And in there you can see they mentioned using the grace termination time that I confirmed worked and covered item #2 (how to make Kubernetes actually wait for the drain)

jandubois commented 3 years ago

cf-operator is simply doing it wrong by calling /var/vcap/jobs/<jobName>/bin/drain/* because in BOSH I am pretty much certain that the drain must be at /var/vcap/jobs/<jobName>/bin/drain .

Yes, it is documented at https://bosh.io/docs/drain/#job-configuration:

To add a drain script to a release job:

Create a script with any name in the templates directory of a release job.

In the templates section of the release job spec file, add the script name and the bin/drain directory as a key value pair.

So I agree that this should be fixed in quarks.

Also it is not kubecf that puts the drain scripts there; it is the upstream releases, e.g. https://github.com/cloudfoundry/diego-release/blob/develop/jobs/rep/spec#L8. It is not really reasonable for kubecf to create pre-rendering patches for all releases to rename the drain script location.

I have no opinion on the actual implementation of the drain mechanism, as I haven't looked into it.

manno commented 3 years ago

Also it is not kubecf that puts the drain scripts there; it is the upstream releases, e.g. https://github.com/cloudfoundry/diego-release/blob/develop/jobs/rep/spec#L8. It is not really reasonable for kubecf to create pre-rendering patches for all releases to rename the drain script location.

Ah, thanks. That was the piece I was missing. We'll remove the dir support from quarks and just call the script.

univ0298 commented 3 years ago

Even with the drain script path corrected, there are problems.

With BOSH, the lifecycle is that no job will be stopped until all jobs are drained. For example on a Diego-cell the garden job will not be stopped until the rep drain has finished. This is actually critical as the rep drain won't work if the garden process is stopped.

But with quarks and kubecf this is not working. The garden job has no explicit drain of its own so quarks and kubeCF just shut it down immediately, even though rep is only just beginning to drain. And therefore the overall drain of the cell fails because rep errors out as a consequence of the premature stopping of garden.

univ0298 commented 3 years ago

@manno @jandubois any thoughts on this? I guess we could manually add stuff in preStops , like for example adding a "is rep still running?" check on the garden container preStop. But this would have to be done as non-standard stuff on top of BOSH package content. And in theory it should be done in a more complete way to match BOSH where no containers end until all preStop (drains) complete

manno commented 3 years ago

So terminationGracePeriod is not enough?

univ0298 commented 3 years ago

@manno right. terminationGracePeriod is applying only to the individual container as I explained here: https://github.com/cloudfoundry-incubator/quarks-operator/issues/1283#issuecomment-783320962

To match the BOSH behavior , all containers in the pod should stay running until all the individual container's preStop (drain) have each completed.

Is it clear?

manno commented 3 years ago

Thanks I understand now. I think finalizers (which was the original idea behind #175084409 would also not work, as we need to keep the containers running.

I think https://www.pivotaltracker.com/story/show/168130844 is involved here, but it seems the other way around.

Currently all jobs are wrapped by container-run, but that code is unmaintained. It does however look like the right place, to make the containers wait before killing of the processes.

univ0298 commented 3 years ago

@manno would it be possible to have a new quarks-operator release with at least the drain script location corrected as a first step here?

manno commented 3 years ago

@univ0298 working on it, hopefully today.

univ0298 commented 3 years ago

@manno thanks. Any revised estimate on when it might be possible?

univ0298 commented 3 years ago

I saw v7.2.0 is available with the initial fix, but it's not in the helm repo yet

jandubois commented 3 years ago

I saw v7.2.0 is available with the initial fix, but it's not in the helm repo yet

I don't know about the helm repo, but we have quarks-operator v7.2.1 in kubecf master, and also in the release-2.7 branch for the kubecf v2.7.13 build/tag, which should be released early next week.

univ0298 commented 3 years ago

The helm repo has 7.2.1 so that's great thanks

So at this point the initial issue that drain wasn't called, is fixed. But the remaining issue is that each container in the pod will terminate when its own drain has completed, whereas the equivalent in BOSH is that all containers stay alive until all drains have completed.

Should I open this as a follow on issue, or we keep this one?

Also... so far I haven't found anything that will allow for linking the lifecycle of containers to one another in a pod. The only workaround I have found is completely customizing the drain script in one container to check whether another container is still running using custom code. For example in the garden preStop I can keep querying rep to see if rep is still running, and loop until it's not. But of course that's not a generalized solution.

jandubois commented 3 years ago

Should I open this as a follow on issue, or we keep this one?

I would like to see this as a separate issue, as I expect it will again attract a length thread, so let's keep it separate.

Also... so far I haven't found anything that will allow for linking the lifecycle of containers to one another in a pod. The only workaround I have found is completely customizing the drain script in one container to check whether another container is still running using custom code.

Not sure I understand this one: would this be a workaround until all containers wait until all drains have finished? Or why would this still be necessary when drains works correctly? I would have expected that upstream drain scripts should work as-is once the drain wait logic exists.

univ0298 commented 3 years ago

@jandubois let me try to explain quick here, and then eventually in the new issue.

In BOSH the way that it works is that: (1) Every job gets its drain run. When an individual job's drain finishes, the job's process stays running (2) Only after all the drains of all the jobs have finished, the jobs get terminated on that server

In KubeCF/quarks the way it works: (1) Every job/container in the pod gets its drain run (now the first issue is fixed). BUT as soon as an individual job/container's drain completes, that job/container is terminated (even though other jobs/containers in the pod are still draining)

So for example on a Diego-cell you have rep and garden jobs. Typically the rep drain will take a long time (in our case it can be typically 10 minutes) as it is trying to evacuate user application instances to other cells (to avoid application downtime) before terminating those application instances. But for rep to do this the garden job MUST stay running on that cell. And of course in quarks/kubeCF it does not stay running due to garden's own drain having already completed long before the rep is finished draining.

So I can workaround this by modifying the garden drain script to insert a looping check to "wait for rep to not be running" as a manual way to tie the garden drain duration/lifecycle to the rep's one. But this is a hack to be honest, and also it is only good for this one specific scenario. There are various other drains in Cloud Foundry that will rely on the BOSH approach of keeping all jobs running until the last drain completes. So really a generalized solution is needed.

jandubois commented 3 years ago

So I can workaround this by modifying the garden drain script

Ok, that's what I thought. If we can find a way for containers to only be terminated once all drain scripts in the pod have finished, then you would not need any changes to the drain scripts.

The question is how we do this, and maybe this can be done by container-run, or some other mechanism.

We should discuss the details in a new issue.

univ0298 commented 3 years ago

follow on issue: https://github.com/cloudfoundry-incubator/quarks-operator/issues/1297 is created

manno commented 3 years ago

ok, closing this one, since the drain script is being called.

cloudfoundry-incubator / quarks-operator

BOSH drain does not work due to calling the wrong script location #1283