cloudfoundry-incubator / kubecf

Cloud Foundry on Kubernetes
Apache License 2.0
115 stars 62 forks source link

Many Cloud Foundry drain scripts do not work in KubeCF due to no monit and no bpm #1728

Open univ0298 opened 3 years ago

univ0298 commented 3 years ago

Describe the bug After recent fixes were completed in quarks to properly call drain scripts with the right lifecycle, it has now been discovered that many of these drain scripts do not work because their logic has dependencies upon monit or bpm.

For example some drain scripts are looking for the PID file for the job they're trying to drain, and that PID file is written by monit or bpm in a BOSH environment but is missing in KubeCF. Or some drain scripts are trying to use bpm commands to control the drain/stop of the jobs but the bpm command is missing.

Some specific examples are:

rep (Diego-cell): https://github.com/cloudfoundry/diego-release/blob/master/jobs/rep/templates/drain.erb#L49-L53

<% if p("bpm.enabled") %>
if ! /var/vcap/jobs/bpm/bin/bpm pid rep >/dev/null 2>&1; then
  echo "$(date +%Y-%m-%dT%H:%M:%S.%sZ): rep process not running"
  exit 0
fi

The above code causes the rep drain script to immediately end with "rep process not running" because bpm is missing and therefore /var/vcap/jobs/bpm/bin/bpm is missing.

GoRouter: https://github.com/cloudfoundry/routing-release/blob/develop/jobs/gorouter/templates/drain.erb#L11-L20

if [[ ! -f ${pidfile} ]]; then

  if [ "$log_format" == "deprecated" ]; then
    echo "$(date): pidfile does not exist" >> ${logfile}
  else
    echo "$(date +%Y-%m-%dT%H:%M:%S.%NZ): pidfile does not exist" >> ${logfile}
  fi
  echo 0
  exit 0
fi

The above code causes the gorouter drain script to immediately end because there is no pidfile at /var/vcap/sys/run/bpm/gorouter/gorouter.pid and so this logic immediately exits, causing significant disruption because all inflight gorouter requests are unable to complete through a graceful drain/quiesce. Perhaps on this example it would be a relatively simple thing to have a PID file written, so that this logic proceeds as it would do in BOSH.

Auctioneer: https://github.com/cloudfoundry/diego-release/blob/develop/jobs/auctioneer/templates/drain.erb#L6

/var/vcap/jobs/bpm/bin/bpm stop auctioneer 1>&2 || true

In this case it's a simple bpm stop, but even that doesn't work because bpm is missing. The bpm stop behavior is to send a SIGTERM and then wait 15 seconds (https://github.com/cloudfoundry/bpm-release/blob/master/src/bpm/commands/stop.go#L27) and the KILL the container. Although the Kubernetes container shutdown behaves similarly instead, it isn't identical. The Kubernetes container shutdown will also trigger SIGTERM but then it will wait for terminateGracePeriodSeconds which is set at the pod scope, not at an individual container. And in many case it will have to be a lot higher than 15 seconds. Nevertheless in most practical situations I think a simple bpm stop (like auctioneer has) will work in the same way in KubeCF if the terminationGracePeriodSeconds has been set to 15 or more seconds on the pod. https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

Cloud Controller: https://github.com/cloudfoundry/capi-release/blob/develop/jobs/cloud_controller_ng/templates/drain.sh.erb

for i in {1..<%=p("cc.jobs.local.number_of_workers")%>}; do
  /var/vcap/jobs/bpm/bin/bpm stop cloud_controller_ng -p "local_worker_${i}" 1>&2
done

/var/vcap/jobs/cloud_controller_ng/bin/shutdown_drain 1>&2

In this case it's partially similar to the auctioneer in that it's using bpm stop for the local worker processes. But the standard Kubernetes behavior will not be the same this time because this script is stopping different processes in a specific sequence that Kubernetes would not do. Specifically this drain script is stopping the local worker processes one by one in the for loop. Then, also very importantly, it's triggering a custom drain script for the main cloud controller: https://github.com/cloudfoundry/capi-release/blob/develop/jobs/cloud_controller_ng/templates/shutdown_drain.rb.erb and this is critical for the Cloud Controller to gracefully quiesce its work. It seems therefore that a replacement drain script will be needed here that accomplishes those key differences of (i) sequentially stopping the local workers first, (ii) running the custom drain script shutdown_drain

Cloud Controller Worker https://github.com/cloudfoundry/capi-release/blob/develop/jobs/cloud_controller_worker/templates/drain.sh.erb This one is a simpler equivalent to the Cloud Controller one in that it only has the sequential stop behavior that is needed, but no custom drain beyond that.

To Reproduce Deploy KubeCF with terminationGracePeriodSeconds set sufficiently high that drains should have time to work. Then delete pods and watch the various drains fail. In flight workloads will see disruption as a result.

Expected behavior Drains should behave the same as in BOSH

Environment

Additional context Quarks Operator fix that corrected drain lifecycle https://github.com/cloudfoundry-incubator/quarks-operator/pull/1302

univ0298 commented 3 years ago

Note that the cloud controller drain referenced above is actually from a newer CF deployment (CAPI specifically) version that the latest kubeCF has, so it's not yet an issue.

The gorouter drain is a problem already in the current release of kubecf

univ0298 commented 3 years ago

I discussed pid and pidfile references in drain scripts with @jandubois in more detail. We have the example from the rep job where the drain script tries to get the pid using bpm: /var/vcap/jobs/bpm/bin/bpm pid rep and then we have the gorouter drain example where it looks for the pid in a pidfile that monit writes in a BOSH environment (because the monitrc file for that job defined the pidfile location and the drain is looking in the matching place): pidfile=/var/vcap/sys/run/bpm/gorouter/gorouter.pid (which in turn was defined in the monitrc file: https://github.com/cloudfoundry/routing-release/blob/develop/jobs/gorouter/monit#L2)

It would be helpful if container-run could assist on trying to write such pid files, and if bpm pid could be implemented. That would minimize (perhaps reduce to zero) what drain script patching was needed. There is the concern that the pid file location in a monitrc file is not guaranteed (so container-run can't be sure) but any deviations from the standard convention could be handled by drain script patches on as-needed basis.

I also reviewed our BOSH environment for monitrc file pidfile references and this is what I round - basically seems to be consistent naming format (with bpm added when bpm used):

/var/vcap/sys/run/bpm/loggregator_agent/loggregator_agent.pid
/var/vcap/sys/run/bpm/loggr-udp-forwarder/loggr-udp-forwarder.pid
/var/vcap/sys/run/bpm/gorouter/gorouter.pid
/var/vcap/sys/run/bosh-dns/bosh-dns.pid
/var/vcap/sys/run/bosh-dns/bosh_dns_resolvconf.pid
/var/vcap/sys/run/bosh-dns/bosh_dns_health.pid
/var/vcap/sys/run/bpm/loggr-syslog-agent/loggr-syslog-agent.pid
/var/vcap/sys/run/bpm/log-cache-nozzle/log-cache-nozzle.pid
/var/vcap/sys/run/bpm/log-cache-gateway/log-cache-gateway.pid
/var/vcap/sys/run/bpm/log-cache-cf-auth-proxy/log-cache-cf-auth-proxy.pid
/var/vcap/sys/run/bpm/cloud_controller_ng/nginx.pid
/var/vcap/sys/run/bpm/cloud_controller_ng/local_worker_1.pid
/var/vcap/sys/run/bpm/cloud_controller_ng/local_worker_2.pid
/var/vcap/sys/run/bpm/route_registrar/route_registrar.pid
/var/vcap/sys/run/bpm/cc_uploader/cc_uploader.pid
/var/vcap/sys/run/bpm/policy-server-internal/policy-server-internal.pid