Kong / charts

Helm chart for Kong
Apache License 2.0
249 stars 480 forks source link

Review wait-for-db options #428

Closed ludovic-pourrat closed 3 years ago

ludovic-pourrat commented 3 years ago

Hi,

Currently there are two available options in order to setup a hook for waiting for the database to show up.

  1. A Kong binary one, which is database type aware (Postgres, Cassandra).
  2. A low level TCP one, only applicable when Postgres is embedded via a local chart dependency.

A few comments at this stage

Option A

Allow selecting between the TCP low level option and the Kong binary one. This may require the TCP one to support Cassandra too. Allow setting up a wait hook on the jobs.

Option B

Build your own waiting hook, and allow this one to be injected in the jobs too.

Thoughts ?

rainest commented 3 years ago

I'd recommend just setting waitImage.enabled: false. That's our general recommendation for other scenarios (namely use of a service mesh sidecar injector) where initContainer order doesn't play nice with those.

The TCP check effectively only exists for a better experience when using the built-in Postgres chart, since that starts concurrently with the Kong Deployment and initial migrations Job but takes some time to become ready. It's not useful when you provide your own database (which we expect the majority of installations to do) since that database is managed independent of the Kong chart release, and it will usually either already be up when Kong starts (in which case the TCP check is meaningless) or will never come online without external user action (in which case the check will simply block forever).

The Kong check is the more useful of the two if you're going to run any such check. The TCP check is really only useful for initial migrations, since that's the only thing that can run when the database is online but not initialized.

Neither check is really required, however. They serve a mostly cosmetic purpose to avoid Kong initially entering a CrashLoopBackoff state as it waits for migrations to complete. There's no technical issue with doing so, since migrations will normally eventually complete and the Deployment will eventually start a Kong container instance after, and that restart will come online without any issues. We have the cosmetic job in place to try and not confuse new users, who may interpret the CrashLoopBackoff as an actual failure--if you know what's going on during startup you can simply ignore it as long as it doesn't persist for more than a minute or so.

Does that all make sense, and does that option work for you? We likely won't add more complexity to the database wait init containers because they're of such limited use to begin with.

ludovic-pourrat commented 3 years ago

Hi,

@rainest this make sense in most cases as you described, but we are deploying Postgres / Redis / Influx as part of the Kong deployment via an umbrella charts.

We can afford some restart from Kong until the database became available (that's cosmetic), with no database wait hook, but the issue that will remain in that case the migrations jobs would goes over the backoffLimit which takes the default on the cluster. See https://kubernetes.io/docs/concepts/workloads/controllers/job/

We would be then requiring to set the backoffLimit to an higher value on the jobs, to wait for the database and PVC allocation to be scheduled.

That's the only way I think to avoid adding a wait for database option on the jobs outside the embedded use case.

Thoughts ?

rainest commented 3 years ago

Did you encounter that typically in practice? In my own testing, when I was looking at removing the waits entirely, Postgres deployed alongside Kong managed to provision storage and come online before the backoff exceeded its limit, though I understand that could vary wildly depending on the characteristics of your storage provider though.

Helm itself doesn't provide anything to order dependencies in umbrella chart AFAIK, but would https://github.com/ThalesGroup/helm-spray work to set weights on the PG and Kong charts such that PG starts and becomes ready first?

Otherwise I think adding configurable backoffLimit to the migrations is probably the way to go rather than tuning exactly how the wait stuff works. That limit is a standard part of the spec, so it's something users are more likely to be familiar with from general usage (not something specific to our chart) and may have some use beyond this if exposed.

ludovic-pourrat commented 3 years ago

@rainest, yes most often in our testing the default backoffLimit is enough, but we encountered on few deployments the issue where the job was dropped b Kubernetes after the limit. The scheduler was taking longer than usual to provision the PVC. You are right, there is no need to bring more complexity to the chart and the simple way to proceed it to support this feature enhancement for specifying the backoffLimit, to avoid putting the deployment into failure. If you agree I will provide a PR for this feature request.

Thanks !

rainest commented 3 years ago

442 will be available in the next release, 2.4.0