cf-api-server deployment should restart pods if unhealthy

braunsonm commented 3 years ago

Describe the bug

We had CAPI holding onto an unhealthy Postgres DB connection which was causing 502's to the CF API server. This deployment should have proper healthchecking to restart troublesome pods.

To Reproduce*

Startup CAPI normally
Create a DB connection interruption
Notice that 502s are returned but the pod is not restarted

Expected behavior

The Pod should have been restarted when errors started occuring

Additional context

cf-for-k8s SHA

v1.1.0

Cluster information

AKS

cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/176848778

The labels on this github issue will be updated when the story is started.

matt-royal commented 3 years ago

Thank you for the issue report, @braunsonm. We attempted to reproduce the problem you described, but didn't see the same behavior. Here's what we tried and saw in Gerkin:

Given a healthy running cf-for-k8s deployment
When I bring down the postgres server
And curl the `/v3/organizations` endpoint on the cf-api-server
Then I see a 500 response code
And error logs in the cf-api-server container of the cf-api-server deployment

When I bring the postgres server back up
And I curl the same endpoint
Then I see a 200 response code
And observe that the cf-api-server container was never restarted

In scanning the cloud_controller_ng codebase, we only saw a few places where we explicitly return a 502, and they all involve service brokers, uaa calls, or recursive deletes. Were you doing any actions along those lines? Otherwise, it's likely from a timeout and it isn't clear to me exactly what would be timing out.

This issue prompted us to revisit how we use livenessProbes and readinessProbes in the cf-api-server deployment. However, the livenessProbes that make sense for the cf-api-server container don't exercise the database, since endpoints that do so require authentication. Given that, the livenessProbe likely wouldn't address the situation you're describing if the source of the problem truly was the cloud_contoller_ng db connection.

Can you provide more specific reproduction instructions to help us find a solution that will address your issue?

matt-royal commented 3 years ago

This commit was inspired by this issue, though it probably won't resolve it (since the /healthz endpoint doesn't exercise the database): https://github.com/cloudfoundry/capi-k8s-release/commit/7b1c2322af686c9c80bdb4105929dd2ee7243fe8

braunsonm commented 3 years ago

Hi @matt-royal

So I'm not sure how to reproduce CAPI getting in a bad state. But there was some networking issues between our cluster and our database. After that was resolved CAPI was still returning 500's because of the DB connection and could never recover unless I manually restarted it.

So yes this issue was to add livenessProbes to CAPI, however the fact that it doesn't exercise the DB is the bigger problem.

cloudfoundry / cf-for-k8s