Timeout in liveness / readiness probes leading to outage

rodolfobrunner commented 1 month ago

Is there an existing issue for this?

[X] I have searched the existing issues

Kong version (`$ kong version`)

Kong 3.6.1

Current Behavior

When Kong loses connectivity to the Database, the first couple of requests to /status and /metrics take too long to return. Meanwhile, Kubernetes uses the /status to probe for the readiness / liveness status and since it got a timeout, it assumed that the pod was unhealthy.

Until Kubernets decided to restart the container, Kong was proxing requests without a problem.

After looking at kong.api.routes.health.lua we realized that kong.db:connect()instructing to test connectivity is being executed both synchronously and without passing a timeout leading to latency of 5+ secs of the first requests.

The sequence of events were:

Several "harmless" entries of: [error] 1322#0: *664431 [lua] health.lua:61: fn(): failed to connect to PostgreSQL during /status endpoint check: [PostgreSQL error] timeout, client: 10.145.65.245, server: kong_status, request: "GET /status HTTP/1.1", host: "10.145.43.33:8802" [error] 1322#0: *660572 [kong] exporter.lua:353 prometheus: failed to reach database while processing/metrics endpoint: timeout, client: 10.145.46.50, server: kong_status, request: "GET /metrics HTTP/1.1", host: "10.145.43.33:8802"

Followed by a surprise SIGQUIT:

2024/09/16 07:03:42 [notice] 1#0: exit
2024/09/16 07:03:42 [notice] 1#0: worker process 1322 exited with code 0
2024/09/16 07:03:42 [notice] 1#0: signal 17 (SIGCHLD) received from 1322
2024/09/16 07:03:41 [notice] 1322#0: exit
2024/09/16 07:03:41 [notice] 1322#0: exiting
2024/09/16 07:03:38 [notice] 1#0: signal 29 (SIGIO) received
2024/09/16 07:03:38 [notice] 1#0: worker process 1323 exited with code 0
2024/09/16 07:03:38 [notice] 1#0: signal 17 (SIGCHLD) received from 1323
2024/09/16 07:03:37 [notice] 1323#0: exit
2024/09/16 07:03:37 [notice] 1323#0: exiting
2024/09/16 07:03:37 [notice] 1323#0: gracefully shutting down
2024/09/16 07:03:37 [notice] 1322#0: gracefully shutting down
2024/09/16 07:03:37 [notice] 1#0: signal 3 (SIGQUIT) received from 1990, shutting down

Followed by the apparent shutdown process, we see:

Please note that the pod was not able to die even though it received a SIGQUIT. After the pod regained the DB connectivity, it remained the same. The reason for this is likely a crashloopbackoff while shutting down, as it tried to save some state in a DB it had no connectivity.

Expected Behavior

Kong had no problem to route request without DB connectivity until it is taken out from the cluster due to /status timeout.

As the DB is not in the hot path, we did not expect an outage on a DB restart.

In our opinion, kong.db:connect() to test connectivity should be done either asynchronously or have a timeout

Steps To Reproduce

Setup a Kong in traditional mode
Disable the DB
Make a request to any route (should work)
Make a request to /status (should have a latency spike for the first attempts)

Anything else?

No response

chobits commented 1 month ago

hi,

I just want to briefly mention the timeout issue.

After looking at kong.api.routes.health.lua we realized that kong.db:connect()instructing to test connectivity is being executed both synchronously and without passing a timeout leading to latency of 5+ secs of the first requests.

In our opinion, kong.db:connect() to test connectivity should be done either asynchronously or have a timeout

I think this behaviour is expected, the timeout is configured through the pg_timeout option from the kong.conf file, see the default value as follows:

#pg_timeout = 5000               # Defines the timeout (in ms), for connecting,
                                 # reading and writing.

rodolfobrunner commented 1 month ago

Thank you for the reply @chobits I tried to follow the code and the connect / reconnect looked to be receiving an config that apparently was not being passed in kong.db:connect(). But I now realize that it is as a self as first argument.

In your opinion, do you think the Kong chart liveness / readiness timeout having default value as pg_timeout might be problematic?

I am not sure about touching pg_timeout as idk exactly what Kong is doing under the hood and what are its needs, but it is bounded to 5seconds i can raise the probes timeout.

Kong / kong