cloudfoundry-incubator / kubecf

Cloud Foundry on Kubernetes
Apache License 2.0
115 stars 62 forks source link

Evacuate/drain not working on cells #1246

Open gaktive opened 4 years ago

gaktive commented 4 years ago

Got a bug report from Slack; there's a question as to "whether the evacuate/drain behaviour on cells actually works".

I see the evacuate on rep begin but then it complains it can’t talk to garden (presumably the garden container has terminated) and so all the draining just gives up and the apps are killed.

With the latest kubeCF with CFD13.9, this is seen in the logs:

{"timestamp":"2020-08-20T09:11:40.171443962Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.delete-container.failed-to-delete-garden-container","data":{"error":"Delete \"http://api/containers/58574e5f-3b3b-4526-51c8-a5b1\": dial unix /var/vcap/data/garden/garden.sock: connect: no such file or directory","guid":"58574e5f-3b3b-4526-51c8-a5b1","session":"13.38"}}

which is caused by garden not being available to connect to any more

jandubois commented 4 years ago

Tagging @univ0298 (this is your bug/question from Slack).

andrew-edgar commented 4 years ago

Is there plans to schedule this to be looked at. It's a big issue if cells are not being drained when they are stopped. I think this used to work before in SCF.

andreas-kupries commented 4 years ago

Looking into local reproduction.

Given that the high-level operation (apps are evacuated and moved on loss of a cell) looks to work, how much of a problem is this ticket actually, and how high a priority should it have ?

@gaktive @jandubois Thoughts ?

jandubois commented 4 years ago

@andreas-kupries I think this issue is about apps not being properly drained, and not about the fact that they are not being rescheduled.

The expected process for draining a cell would be:

As soon as an app instance quits, the auctioneer will schedule a new instance on one of the still active cells.

Now with kubecf we can just kill the diego-cell pod. This should trigger a drain script that will tell all app instances to shut down. After 30s kubernetes will shut down the container (controlled by terminationGracePeriodSeconds).

I have no idea which part of the draining functionality is implemented.

Maybe @univ0298 or @andrew-edgar can comment on what they expect to see when they kill a diego-cell pod (or are you referring to something else)?

I have no opinion the the priority of this issue until I know what is working, and what isn't.

andrew-edgar commented 4 years ago

When a cell drains it actually starts it up on a different cell FIRST! So there is no downtime at all.

And the rep will only kill the container when it has heard back from the BBS that the app is started somewhere else.

During this "draining" (which can take a while up to 10min) we need garden to still run so that we can stop the apps only after they have been restarted on other cells.

The big issue is that garden is stopped and the cell is stopped and we definitely have app downtime when a cell is stopped.

This is a HUGE priority! As we should never have app downtime when a cell is evacuating and stopping. If we are going to have app downtime that is a very big shortcoming as compared to the bosh world.

andrew-edgar commented 4 years ago

Here is the CF docs about how "evacuation" is supposed to work https://docs.cloudfoundry.org/devguide/deploy-apps/app-lifecycle.html#evacuation

The key is CF recreates the app instances on another VM, waits until they are healthy, and then shuts down the old instances.

So on shutdown the app needs to start on another cell BEFORE it is shutdown in the evacuating cell.

univ0298 commented 4 years ago

Without this working, kubeCF is not viable for production use. Every time maintenance is deployed and a cell has to restart, it would cause major disruption to all apps on all cells. This evacuate/drain is a core feature of CF

viovanov commented 4 years ago

@univ0298 @andrew-edgar I've filed https://www.pivotaltracker.com/story/show/175084409 as this is a Quarks issue.

We'll try to get to it soon, but help is welcome.

mudler commented 3 years ago

We have merged https://github.com/cloudfoundry-incubator/quarks-operator/pull/1224 which enables to set terminationGracePeriod on instance groups, like the following: https://github.com/cloudfoundry-incubator/quarks-operator/blob/master/docs/examples/bosh-deployment/quarks-gora-termination.yaml#L23 . It also tests that the container isn't terminated before the prestop hook is completed ( in a defined time window ).

This should allow containers to terminate in a defined time window as suggested in https://www.pivotaltracker.com/story/show/175084409 , which could potentially solve the issue by setting the timeout to 10 minutes to the diego-cell ig

andrew-edgar commented 3 years ago

@mudler there is a property in Diego at least that needs to be used for setting this timeout. ...

https://github.com/cloudfoundry/diego-release/blob/develop/jobs/rep/spec#L49

This is how long to allow "evacuation" to occur before a shutdown of the cell. we should not be looking at hard coding or looking at other settings when setting this "timeout" value

jandubois commented 3 years ago

we should not be looking at hard coding or looking at other settings when setting this "timeout" value

Well, the first step would be for you to test that the mechanism works as expected, e.g. by setting the agent setting via custom ops.

Once that is confirmed, we can add some templating magic to kubecf to set the bosh agent setting automatically from the diego.rep property. We will still have to hard-code the default setting from the spec file in the chart because the helm chart doesn't have access to spec defaults; only to properties overridden in cf-deployment.

See https://github.com/cloudfoundry-incubator/kubecf/blob/master/chart/config/jobs.yaml#L23-L40 for an example where we have to do the same thing for the default CC worker instance numbers.

So we would include the default in a config somewhere:

properties:
  diego-cell:
    rep:
      diego:
        rep:
          evacuation_timeout_in_seconds: 600

and generate the agent setting from that value. It will then pick up any override from the user automatically (assuming the override happens via the property mechanism and not via custom ops).

univ0298 commented 3 years ago

OK so I had time to investigate this in detail and I found that drain is actually completely broken across the whole of KubeCF due to some significant mistakes in the implementation code. It's not just cells that are affected, but the drain of every job/container in every instance group is broken.

That's because the following cf-operator code that sets up the drains is simply pointing at the wrong location for drain scripts!! https://github.com/cloudfoundry-incubator/quarks-operator/blob/v7.1.3/pkg/bosh/bpmconverter/container_factory.go#L611-L650

This code in particular:

    // Setup the job drain script handler.
    drainGlob := filepath.Join(VolumeJobsDirMountPath, jobName, "bin", "drain", "*")

is building a drain location of /var/vcap/jobs/<jobName>/bin/drain/* but that is NOT where drain scripts are in BOSH. There is no such subdirectory of drain and instead the drain scripts are actually at /var/vcap/jobs/<jobName>/bin/drain (the script itself is called drain and there's no subdirectory).

So it seems that this likely has never worked and never been tested or used by anyone.

Should I open a separate issue in cf-operator for that part?

With regards to the above discussion on the KubeCF side, of the rep evacuation_timeout_in_seconds parameter, and the suggestion to use terminationGracePeriod for now (since the rep property is not supported yet), I can confirm that yes setting the terminationGracePeriod was effective in modifying the time that the preStop will be allowed to run for (I tested this by manually patching the cf-operator error I described above to run a drain that really required time to execute). That's good, but as @andrew-edgar and @jandubois discussed above, it should really be done properly via property support as Jan mentioned:

Once that is confirmed, we can add some templating magic to kubecf to set the bosh agent setting automatically from the diego.rep property. We will still have to hard-code the default setting from the spec file in the chart because the helm chart doesn't have access to spec defaults; only to properties overridden in cf-deployment.

Will this issue be used to address that?

univ0298 commented 3 years ago

This is the commit that first introduced drain support (attempted to) and it seems the issue was there from the beginning https://github.com/cloudfoundry-incubator/quarks-operator/commit/870f9a49343567e4fc1d7a6262fba56bfa52fc7c

univ0298 commented 3 years ago

I went ahead and opened an issue on the cf-operator side as it seems inevitable that's a good idea. https://github.com/cloudfoundry-incubator/quarks-operator/issues/1283

univ0298 commented 3 years ago

Even with the quarks script fixed to correctly call the BOSH drain script, things don't work:

https://github.com/cloudfoundry-incubator/quarks-operator/issues/1283#issuecomment-783320962