Make sure terminated containers are considered as already drained

jandubois commented 3 years ago

As @univ0298 points out in Slack, the drain script support from #1302 is still incomplete:

When the main process of a container exits, then the preStop script will not gain back control because there is nothing left to do: the container is already stopped. This is "documented" in Add e2e to verify blocking behavior of preStop hook and PreStop process is not completed when within termination grace period.

PreStop doesn't block a container from exiting or crashing. If the main process exits, that's game over.

My suggestion to work around this issue is to explicitly set the stamps for all terminated containers explicitly before counting how many drain scripts have finished. Proof of concept code:

#!/usr/bin/env bash

NS=kubecf
POD=diego-cell-0

JSON=$(kubectl get pods -n ${NS} ${POD} -o=json)
CONTAINERS=$(echo "${JSON}" |
                   jq -r '.status.containerStatuses[]
                        | select(.state.terminated).name')

for CONTAINER in ${CONTAINERS}; do
    STAMP=$(echo "${JSON}" |
                  jq -r ".spec.containers[]
                       | select(.name==\"${CONTAINER}\").lifecycle
                       | select(.preStop).preStop.exec.command[]
                       | capture(\"(?<a>/mnt/drain-stamps/.*)\n\").a")
    if [ -n "${STAMP}" ]; then
        touch ${STAMP}
    fi
done

This generates a list of all containers that have a state.containerStatuses.terminated field. I then scrape the actual stamp filename out of the preStop commands and make sure the stamp file exists. It is creating stamps for those containers that didn't get a chance to do it themselves.

I've tested the script by replacing the condition above with state.running and used echo instead of touch:

$ ./drain-stamp.sh
/mnt/drain-stamps/garden
/mnt/drain-stamps/loggr-forwarder-agent
/mnt/drain-stamps/loggr-syslog-agent
/mnt/drain-stamps/loggr-udp-forwarder
/mnt/drain-stamps/loggregator_agent
/mnt/drain-stamps/netmon
/mnt/drain-stamps/prom_scraper
/mnt/drain-stamps/rep
/mnt/drain-stamps/route_emitter
/mnt/drain-stamps/silk-daemon
/mnt/drain-stamps/vxlan-policy-agent

I know that we still have other issues with drain scripts, but this should get us one step closer.

@manno What do you think? It feels quite hacky, but on the other hand, it should be pretty straight-forward to add; everything else I can think of is much more complicated.

Remaining issues: It looks like we don't currently include kubectl in the images, so need to figure out how to make that part work.

And I haven't checked if the service account has a role binding that allows get pod access from within the containers.

cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/178014456

The labels on this github issue will be updated when the story is started.

jandubois commented 3 years ago

set the stamps for all terminated containers explicitly before counting how many drain scripts have finished

What I mean by this is inserting the logic from the script above in the waitExit function at this line just after the sleep 5 statement.

manno commented 3 years ago

I would prefer if the drain scripts were patched to behave better on Kubernetes. We could put a wrapper around rep to keep it from terminating.

These problems come to mind, when adding cluster inspection to the container 1 drain script:

We would need to modify the stemcell to include jq/kubectl.
The script needs to check for binaries, in case it's running with an old stemcell that doesn't have kubectl. We might need a fallback behaviour?
Would need more privileges for pod list/get. Which includes the full YAML of each pod in the namespace. This includes things like env vars that are not mounted.
- I checked, BOSH jobs don't have access to the cluster currently
kubectl doesn't report the real status fields, as far as I can remember it's an aggregate over several fields. There is some logic in kubectl, regarding the pod status field. (Just mentioning this, because it makes coding the check a bit harder).
- https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
- https://github.com/kubernetes/kubectl/blob/master/pkg/util/podutils/podutils.go

The drain script is already doing a lot and it feels wrong to put more logic there. We could implement this in Go, to get rid of the dependencies on jq/kubectl. However, how would that work, we add another sidecar container (with the operator image) and run the new pre stop hook there?

manno commented 3 years ago

I don't think we can securely add get/list pods RBAC to the service account used by the pods.

If we need to check cluster state, we will need to add this to the operator somehow (i.e. a pod with more access).

First thing that comes to my mind is switching to http PreStop hooks. We’d need a small API server to implement the checks.

jandubois commented 3 years ago

This approach does not work because kubernetes never switches a container status to terminated when the main process in the container exits while the PreStop hook is still running. Instead the hook will be killed and exits with code 137, but the container status remains running until the grace period expires, even though there are no processes left running inside the container.

We'll look into an alternate approach by modifying container-run to not exit until it receives a SIGTERM signal.

univ0298 commented 3 years ago

Yes that’s a good description of why this type of exit will not work thanks

cloudfoundry-incubator / quarks-operator

Make sure terminated containers are considered as already drained #1308