Evacuate/drain not working on cells

gaktive commented 4 years ago

Got a bug report from Slack; there's a question as to "whether the evacuate/drain behaviour on cells actually works".

I see the evacuate on rep begin but then it complains it can’t talk to garden (presumably the garden container has terminated) and so all the draining just gives up and the apps are killed.

With the latest kubeCF with CFD13.9, this is seen in the logs:

{"timestamp":"2020-08-20T09:11:40.171443962Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.delete-container.failed-to-delete-garden-container","data":{"error":"Delete \"http://api/containers/58574e5f-3b3b-4526-51c8-a5b1\": dial unix /var/vcap/data/garden/garden.sock: connect: no such file or directory","guid":"58574e5f-3b3b-4526-51c8-a5b1","session":"13.38"}}

which is caused by garden not being available to connect to any more

jandubois commented 4 years ago

Tagging @univ0298 (this is your bug/question from Slack).

andrew-edgar commented 4 years ago

Is there plans to schedule this to be looked at. It's a big issue if cells are not being drained when they are stopped. I think this used to work before in SCF.

andreas-kupries commented 4 years ago

Looking into local reproduction.

minikube (Higher than normal memory: 25G).
Kubecf current head (commit 0e63aa6559b42f996041e8335f8d675d5e35dfe7).
Deployed with sizing.diego_cell.instances: 2.
Using stern to capture all the logs in the kubecf namespace.
After deployment pushed/started goenv.
- The app reports instance id 308efdd6-9ff4-437c-410b-f3f0 in its web page.
- Per the logs this landed in diego-cell-0.
Killed the diego-cell-0 pod.
Repro successful: The error message from the description did appear in the logs, for diego-cell-0.
I forgot to poll the app before killing the cell, therefore I cannot say if there was downtime as the app moved.
I can say that the app was
- restarted,
- got a new instance id 2eb1f09d-295a-476e-7c61-eecc,
- and moved to diego-cell-1.
That is all as expected.

Given that the high-level operation (apps are evacuated and moved on loss of a cell) looks to work, how much of a problem is this ticket actually, and how high a priority should it have ?

@gaktive @jandubois Thoughts ?

jandubois commented 4 years ago

@andreas-kupries I think this issue is about apps not being properly drained, and not about the fact that they are not being rescheduled.

The expected process for draining a cell would be:

no new app instances will be scheduled to the cell that is being drained
app instances are removed from routing table
app instances receive a SIGTERM signal telling them to quit once they have responded to their current request
app instances still running after timeout (default 10 minutes) will be killed

As soon as an app instance quits, the auctioneer will schedule a new instance on one of the still active cells.

Now with kubecf we can just kill the diego-cell pod. This should trigger a drain script that will tell all app instances to shut down. After 30s kubernetes will shut down the container (controlled by terminationGracePeriodSeconds).

I have no idea which part of the draining functionality is implemented.

Maybe @univ0298 or @andrew-edgar can comment on what they expect to see when they kill a diego-cell pod (or are you referring to something else)?

I have no opinion the the priority of this issue until I know what is working, and what isn't.

andrew-edgar commented 4 years ago

When a cell drains it actually starts it up on a different cell FIRST! So there is no downtime at all.

And the rep will only kill the container when it has heard back from the BBS that the app is started somewhere else.

During this "draining" (which can take a while up to 10min) we need garden to still run so that we can stop the apps only after they have been restarted on other cells.

The big issue is that garden is stopped and the cell is stopped and we definitely have app downtime when a cell is stopped.

This is a HUGE priority! As we should never have app downtime when a cell is evacuating and stopping. If we are going to have app downtime that is a very big shortcoming as compared to the bosh world.

andrew-edgar commented 4 years ago

Here is the CF docs about how "evacuation" is supposed to work https://docs.cloudfoundry.org/devguide/deploy-apps/app-lifecycle.html#evacuation

The key is CF recreates the app instances on another VM, waits until they are healthy, and then shuts down the old instances.

So on shutdown the app needs to start on another cell BEFORE it is shutdown in the evacuating cell.

univ0298 commented 4 years ago

Without this working, kubeCF is not viable for production use. Every time maintenance is deployed and a cell has to restart, it would cause major disruption to all apps on all cells. This evacuate/drain is a core feature of CF

viovanov commented 4 years ago

@univ0298 @andrew-edgar I've filed https://www.pivotaltracker.com/story/show/175084409 as this is a Quarks issue.

We'll try to get to it soon, but help is welcome.

mudler commented 3 years ago

We have merged https://github.com/cloudfoundry-incubator/quarks-operator/pull/1224 which enables to set terminationGracePeriod on instance groups, like the following: https://github.com/cloudfoundry-incubator/quarks-operator/blob/master/docs/examples/bosh-deployment/quarks-gora-termination.yaml#L23 . It also tests that the container isn't terminated before the prestop hook is completed ( in a defined time window ).

This should allow containers to terminate in a defined time window as suggested in https://www.pivotaltracker.com/story/show/175084409 , which could potentially solve the issue by setting the timeout to 10 minutes to the diego-cell ig

andrew-edgar commented 3 years ago

@mudler there is a property in Diego at least that needs to be used for setting this timeout. ...

https://github.com/cloudfoundry/diego-release/blob/develop/jobs/rep/spec#L49

This is how long to allow "evacuation" to occur before a shutdown of the cell. we should not be looking at hard coding or looking at other settings when setting this "timeout" value

jandubois commented 3 years ago

we should not be looking at hard coding or looking at other settings when setting this "timeout" value

Well, the first step would be for you to test that the mechanism works as expected, e.g. by setting the agent setting via custom ops.

Once that is confirmed, we can add some templating magic to kubecf to set the bosh agent setting automatically from the diego.rep property. We will still have to hard-code the default setting from the spec file in the chart because the helm chart doesn't have access to spec defaults; only to properties overridden in cf-deployment.

See https://github.com/cloudfoundry-incubator/kubecf/blob/master/chart/config/jobs.yaml#L23-L40 for an example where we have to do the same thing for the default CC worker instance numbers.

So we would include the default in a config somewhere:

properties:
  diego-cell:
    rep:
      diego:
        rep:
          evacuation_timeout_in_seconds: 600

and generate the agent setting from that value. It will then pick up any override from the user automatically (assuming the override happens via the property mechanism and not via custom ops).

univ0298 commented 3 years ago

OK so I had time to investigate this in detail and I found that drain is actually completely broken across the whole of KubeCF due to some significant mistakes in the implementation code. It's not just cells that are affected, but the drain of every job/container in every instance group is broken.

That's because the following cf-operator code that sets up the drains is simply pointing at the wrong location for drain scripts!! https://github.com/cloudfoundry-incubator/quarks-operator/blob/v7.1.3/pkg/bosh/bpmconverter/container_factory.go#L611-L650

This code in particular:

    // Setup the job drain script handler.
    drainGlob := filepath.Join(VolumeJobsDirMountPath, jobName, "bin", "drain", "*")

is building a drain location of /var/vcap/jobs/<jobName>/bin/drain/* but that is NOT where drain scripts are in BOSH. There is no such subdirectory of drain and instead the drain scripts are actually at /var/vcap/jobs/<jobName>/bin/drain (the script itself is called drain and there's no subdirectory).

So it seems that this likely has never worked and never been tested or used by anyone.

Should I open a separate issue in cf-operator for that part?

With regards to the above discussion on the KubeCF side, of the rep evacuation_timeout_in_seconds parameter, and the suggestion to use terminationGracePeriod for now (since the rep property is not supported yet), I can confirm that yes setting the terminationGracePeriod was effective in modifying the time that the preStop will be allowed to run for (I tested this by manually patching the cf-operator error I described above to run a drain that really required time to execute). That's good, but as @andrew-edgar and @jandubois discussed above, it should really be done properly via property support as Jan mentioned:

Once that is confirmed, we can add some templating magic to kubecf to set the bosh agent setting automatically from the diego.rep property. We will still have to hard-code the default setting from the spec file in the chart because the helm chart doesn't have access to spec defaults; only to properties overridden in cf-deployment.

Will this issue be used to address that?

univ0298 commented 3 years ago

This is the commit that first introduced drain support (attempted to) and it seems the issue was there from the beginning https://github.com/cloudfoundry-incubator/quarks-operator/commit/870f9a49343567e4fc1d7a6262fba56bfa52fc7c

univ0298 commented 3 years ago

I went ahead and opened an issue on the cf-operator side as it seems inevitable that's a good idea. https://github.com/cloudfoundry-incubator/quarks-operator/issues/1283

univ0298 commented 3 years ago

Even with the quarks script fixed to correctly call the BOSH drain script, things don't work:

https://github.com/cloudfoundry-incubator/quarks-operator/issues/1283#issuecomment-783320962

cloudfoundry-incubator / kubecf

Evacuate/drain not working on cells #1246