Open gaktive opened 4 years ago
Tagging @univ0298 (this is your bug/question from Slack).
Is there plans to schedule this to be looked at. It's a big issue if cells are not being drained when they are stopped. I think this used to work before in SCF.
Looking into local reproduction.
sizing.diego_cell.instances: 2
.stern
to capture all the logs in the kubecf namespace.goenv
.
308efdd6-9ff4-437c-410b-f3f0
in its web page.I can say that the app was
2eb1f09d-295a-476e-7c61-eecc
,diego-cell-1
.That is all as expected.
Given that the high-level operation (apps are evacuated and moved on loss of a cell) looks to work, how much of a problem is this ticket actually, and how high a priority should it have ?
@gaktive @jandubois Thoughts ?
@andreas-kupries I think this issue is about apps not being properly drained, and not about the fact that they are not being rescheduled.
The expected process for draining a cell would be:
As soon as an app instance quits, the auctioneer will schedule a new instance on one of the still active cells.
Now with kubecf we can just kill the diego-cell pod. This should trigger a drain script that will tell all app instances to shut down. After 30s kubernetes will shut down the container (controlled by terminationGracePeriodSeconds
).
I have no idea which part of the draining functionality is implemented.
Maybe @univ0298 or @andrew-edgar can comment on what they expect to see when they kill a diego-cell pod (or are you referring to something else)?
I have no opinion the the priority of this issue until I know what is working, and what isn't.
When a cell drains it actually starts it up on a different cell FIRST! So there is no downtime at all.
And the rep will only kill the container when it has heard back from the BBS that the app is started somewhere else.
During this "draining" (which can take a while up to 10min) we need garden to still run so that we can stop the apps only after they have been restarted on other cells.
The big issue is that garden is stopped and the cell is stopped and we definitely have app downtime when a cell is stopped.
This is a HUGE priority! As we should never have app downtime when a cell is evacuating and stopping. If we are going to have app downtime that is a very big shortcoming as compared to the bosh world.
Here is the CF docs about how "evacuation" is supposed to work https://docs.cloudfoundry.org/devguide/deploy-apps/app-lifecycle.html#evacuation
The key is CF recreates the app instances on another VM, waits until they are healthy, and then shuts down the old instances.
So on shutdown the app needs to start on another cell BEFORE it is shutdown in the evacuating cell.
Without this working, kubeCF is not viable for production use. Every time maintenance is deployed and a cell has to restart, it would cause major disruption to all apps on all cells. This evacuate/drain is a core feature of CF
@univ0298 @andrew-edgar I've filed https://www.pivotaltracker.com/story/show/175084409 as this is a Quarks issue.
We'll try to get to it soon, but help is welcome.
We have merged https://github.com/cloudfoundry-incubator/quarks-operator/pull/1224 which enables to set terminationGracePeriod
on instance groups, like the following: https://github.com/cloudfoundry-incubator/quarks-operator/blob/master/docs/examples/bosh-deployment/quarks-gora-termination.yaml#L23 . It also tests that the container isn't terminated before the prestop hook is completed ( in a defined time window ).
This should allow containers to terminate in a defined time window as suggested in https://www.pivotaltracker.com/story/show/175084409 , which could potentially solve the issue by setting the timeout to 10 minutes to the diego-cell ig
@mudler there is a property in Diego at least that needs to be used for setting this timeout. ...
https://github.com/cloudfoundry/diego-release/blob/develop/jobs/rep/spec#L49
This is how long to allow "evacuation" to occur before a shutdown of the cell. we should not be looking at hard coding or looking at other settings when setting this "timeout" value
we should not be looking at hard coding or looking at other settings when setting this "timeout" value
Well, the first step would be for you to test that the mechanism works as expected, e.g. by setting the agent setting via custom ops.
Once that is confirmed, we can add some templating magic to kubecf to set the bosh agent setting automatically from the diego.rep property. We will still have to hard-code the default setting from the spec file in the chart because the helm chart doesn't have access to spec defaults; only to properties overridden in cf-deployment.
See https://github.com/cloudfoundry-incubator/kubecf/blob/master/chart/config/jobs.yaml#L23-L40 for an example where we have to do the same thing for the default CC worker instance numbers.
So we would include the default in a config somewhere:
properties:
diego-cell:
rep:
diego:
rep:
evacuation_timeout_in_seconds: 600
and generate the agent setting from that value. It will then pick up any override from the user automatically (assuming the override happens via the property
mechanism and not via custom ops).
OK so I had time to investigate this in detail and I found that drain is actually completely broken across the whole of KubeCF due to some significant mistakes in the implementation code. It's not just cells that are affected, but the drain of every job/container in every instance group is broken.
That's because the following cf-operator code that sets up the drains is simply pointing at the wrong location for drain scripts!! https://github.com/cloudfoundry-incubator/quarks-operator/blob/v7.1.3/pkg/bosh/bpmconverter/container_factory.go#L611-L650
This code in particular:
// Setup the job drain script handler.
drainGlob := filepath.Join(VolumeJobsDirMountPath, jobName, "bin", "drain", "*")
is building a drain location of /var/vcap/jobs/<jobName>/bin/drain/*
but that is NOT where drain scripts are in BOSH. There is no such subdirectory of drain
and instead the drain scripts are actually at /var/vcap/jobs/<jobName>/bin/drain
(the script itself is called drain and there's no subdirectory).
So it seems that this likely has never worked and never been tested or used by anyone.
Should I open a separate issue in cf-operator for that part?
With regards to the above discussion on the KubeCF side, of the rep evacuation_timeout_in_seconds
parameter, and the suggestion to use terminationGracePeriod
for now (since the rep property is not supported yet), I can confirm that yes setting the terminationGracePeriod
was effective in modifying the time that the preStop will be allowed to run for (I tested this by manually patching the cf-operator error I described above to run a drain that really required time to execute). That's good, but as @andrew-edgar and @jandubois discussed above, it should really be done properly via property support as Jan mentioned:
Once that is confirmed, we can add some templating magic to kubecf to set the bosh agent setting automatically from the diego.rep property. We will still have to hard-code the default setting from the spec file in the chart because the helm chart doesn't have access to spec defaults; only to properties overridden in cf-deployment.
Will this issue be used to address that?
This is the commit that first introduced drain support (attempted to) and it seems the issue was there from the beginning https://github.com/cloudfoundry-incubator/quarks-operator/commit/870f9a49343567e4fc1d7a6262fba56bfa52fc7c
I went ahead and opened an issue on the cf-operator side as it seems inevitable that's a good idea. https://github.com/cloudfoundry-incubator/quarks-operator/issues/1283
Even with the quarks script fixed to correctly call the BOSH drain script, things don't work:
https://github.com/cloudfoundry-incubator/quarks-operator/issues/1283#issuecomment-783320962
Got a bug report from Slack; there's a question as to "whether the evacuate/drain behaviour on cells actually works".
With the latest kubeCF with CFD13.9, this is seen in the logs:
{"timestamp":"2020-08-20T09:11:40.171443962Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.delete-container.failed-to-delete-garden-container","data":{"error":"Delete \"http://api/containers/58574e5f-3b3b-4526-51c8-a5b1\": dial unix /var/vcap/data/garden/garden.sock: connect: no such file or directory","guid":"58574e5f-3b3b-4526-51c8-a5b1","session":"13.38"}}
which is caused by garden not being available to connect to any more