Open rodolfobrunner opened 1 month ago
hi,
I just want to briefly mention the timeout issue.
After looking at kong.api.routes.health.lua we realized that kong.db:connect()instructing to test connectivity is being executed both synchronously and without passing a timeout leading to latency of 5+ secs of the first requests.
In our opinion, kong.db:connect() to test connectivity should be done either asynchronously or have a timeout
I think this behaviour is expected, the timeout is configured through the pg_timeout
option from the kong.conf
file, see the default value as follows:
#pg_timeout = 5000 # Defines the timeout (in ms), for connecting,
# reading and writing.
Thank you for the reply @chobits
I tried to follow the code and the connect / reconnect looked to be receiving an config that apparently was not being passed in kong.db:connect()
. But I now realize that it is as a self
as first argument.
In your opinion, do you think the Kong chart liveness / readiness timeout having default value as pg_timeout might be problematic?
I am not sure about touching pg_timeout
as idk exactly what Kong is doing under the hood and what are its needs, but it is bounded to 5seconds i can raise the probes timeout.
Is there an existing issue for this?
Kong version (
$ kong version
)Kong 3.6.1
Current Behavior
When Kong loses connectivity to the Database, the first couple of requests to
/status
and/metrics
take too long to return. Meanwhile, Kubernetes uses the/status
to probe for the readiness / liveness status and since it got a timeout, it assumed that the pod was unhealthy.Until Kubernets decided to restart the container, Kong was proxing requests without a problem.
After looking at
kong.api.routes.health.lua
we realized thatkong.db:connect()
instructing to test connectivity is being executed both synchronously and without passing a timeout leading to latency of 5+ secs of the first requests.The sequence of events were:
Several "harmless" entries of:
[error] 1322#0: *664431 [lua] health.lua:61: fn(): failed to connect to PostgreSQL during /status endpoint check: [PostgreSQL error] timeout, client: 10.145.65.245, server: kong_status, request: "GET /status HTTP/1.1", host: "10.145.43.33:8802"
[error] 1322#0: *660572 [kong] exporter.lua:353 prometheus: failed to reach database while processing/metrics endpoint: timeout, client: 10.145.46.50, server: kong_status, request: "GET /metrics HTTP/1.1", host: "10.145.43.33:8802"
Followed by a surprise SIGQUIT:
Followed by the apparent shutdown process, we see:
Please note that the pod was not able to die even though it received a SIGQUIT. After the pod regained the DB connectivity, it remained the same. The reason for this is likely a
crashloopbackoff
while shutting down, as it tried to save some state in a DB it had no connectivity.Expected Behavior
Kong had no problem to route request without DB connectivity until it is taken out from the cluster due to
/status
timeout.As the DB is not in the hot path, we did not expect an outage on a DB restart.
In our opinion,
kong.db:connect()
to test connectivity should be done either asynchronously or have a timeoutSteps To Reproduce
Anything else?
No response