Kong proxy goes unresponsive with normal traffic - #8152 issue - kubelet restarts the container since liveness probe fails

Discussed in https://github.com/Kong/kong/discussions/11709

^{Originally posted by **kranthirachakonda** October 7, 2023} Looks like we are running into Kong [#8152](https://github.com/Kong/kong/issues/8152) issue with 2.8.3 and also 3.0.1. We had new route/service added recently and there is increase in traffic but not huge. May ~15% increase. Only proxy container inside Kong pods gets restart due to /status timing out. We tried changing the timeout/period intervals but that didnt help much. We dont see our containers going above the cpu/memory requests. Behavior is identical, where kong proxy container goes into unresponsive mode for about 4-5 mins and processes again. We also observed that while I gave memory requests of 4G, proxy container never goes above 512Mi. Not sure if there are default limits within LuaVM or threads etc. Can you please help me in debugging further. db-less kong:2.8.3 ``` Limits: cpu: 2500m memory: 6Gi Requests: cpu: 1 memory: 4Gi Liveness: http-get http://:status/status delay=5s timeout=5s period=10s #success=1 #failure=3 ``` kubernetes-ingress-controller:2.8.2 ``` Limits: cpu: 2 memory: 2Gi Requests: cpu: 200m memory: 1Gi Liveness: http-get http://:10254/healthz delay=5s timeout=5s period=10s #success=1 #failure=3 ``` Events: ``` Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 4m7s (x94 over 20h) kubelet Liveness probe failed: Get "http://192.168.10.79:8100/status": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Normal Killing 4m7s (x27 over 20h) kubelet Container proxy failed liveness probe, will be restarted ``` These are few errors I see while proxy containers goes into a condition where status page fails to respond within 5s, causing kubelet to restart when liveness probe fails. ``` [lua] targets.lua:248: could not reschedule DNS resolver timer: process exiting, context: ngx.timer [lua] init.lua:409: [cluster_events] failed to start recurring polling timer: process exiting, context: ngx.timer ``` Admin page ``` { "configuration": { "nginx_kong_conf": "/kong_prefix/nginx-kong.conf", "nginx_kong_stream_conf": "/kong_prefix/nginx-kong-stream.conf", "dns_error_ttl": 1, "kong_env": "/kong_prefix/.kong_env", "dns_not_found_ttl": 30, "dns_stale_ttl": 4, "dns_cache_size": 10000, "dns_order": [ "LAST", "SRV", "A", "CNAME" ], "dns_no_sync": false, "upstream_keepalive_pool_size": 60, "client_ssl_cert_default": "/kong_prefix/ssl/kong-default.crt", "client_ssl_cert_key_default": "/kong_prefix/ssl/kong-default.key", "admin_ssl_cert_default": "/kong_prefix/ssl/admin-kong-default.crt", "admin_ssl_cert_key_default": "/kong_prefix/ssl/admin-kong-default.key", "admin_ssl_cert_default_ecdsa": "/kong_prefix/ssl/admin-kong-default-ecdsa.crt", "ssl_cipher_suite": "intermediate", "admin_ssl_cert_key_default_ecdsa": "/kong_prefix/ssl/admin-kong-default-ecdsa.key", "ssl_ciphers": "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384", "status_ssl_cert_default": "/kong_prefix/ssl/status-kong-default.crt", "status_ssl_cert_key_default": "/kong_prefix/ssl/status-kong-default.key", "nginx_main_worker_rlimit_nofile": "auto", "upstream_keepalive_idle_timeout": 60, "nginx_events_worker_connections": "auto", "nginx_events_multi_accept": "on", "lua_package_cpath": "", "port_maps": [ "443:8443" ], "ssl_cert": [ "/etc/secrets/proxy-tls/tls.crt" ], "proxy_listen": [ "0.0.0.0:8443 http2 ssl" ], "admin_listen": [ "0.0.0.0:8444 ssl" ], "status_listen": [ "0.0.0.0:8100", "0.0.0.0:8543 ssl" ], "stream_listen": [ "off" ], "cluster_listen": [ "off" ], "admin_ssl_cert": [ "/kong_prefix/ssl/admin-kong-default.crt", "/kong_prefix/ssl/admin-kong-default-ecdsa.crt" ], "nginx_admin_client_max_body_size": "10m", "status_ssl_cert": [ "/kong_prefix/ssl/status-kong-default.crt", "/kong_prefix/ssl/status-kong-default-ecdsa.crt" ], "status_ssl_cert_key": [ "/kong_prefix/ssl/status-kong-default.key", "/kong_prefix/ssl/status-kong-default-ecdsa.key" ], "nginx_admin_client_body_buffer_size": "10m", "nginx_user": "kong kong", "nginx_main_user": "kong kong", "nginx_daemon": "off", "nginx_main_daemon": "off", "nginx_worker_processes": "2", "nginx_main_worker_processes": "2", "trusted_ips": {}, "kic": true, "nginx_proxy_real_ip_header": "X-Real-IP", "real_ip_recursive": "off", "nginx_proxy_real_ip_recursive": "off", "client_max_body_size": "0", "nginx_http_client_max_body_size": "0", "client_body_buffer_size": "8k", "nginx_http_client_body_buffer_size": "8k", "pg_port": 50001, "pg_password": "******", "pg_ssl": true, "pg_ssl_verify": false, "pg_max_concurrent_queries": 0, "pg_semaphore_timeout": 60000, "pg_ro_port": 50002, "pg_user": "user1", "pg_ro_password": "******", "pg_ro_ssl": true, "enabled_headers": { "server_tokens": true, "latency_tokens": true, "Via": true, "Server": true, "X-Kong-Proxy-Latency": true, "X-Kong-Response-Latency": true, "X-Kong-Admin-Latency": true, "X-Kong-Upstream-Status": true, "X-Kong-Upstream-Latency": true }, "cassandra_contact_points": [ "127.0.0.1" ], "cassandra_port": 9042, "pg_ro_host": "hawkeye-haproxy-svc.hawkeye-haproxy-ns.svc.cluster.local", "cassandra_ssl": false, "cassandra_ssl_verify": false, "cassandra_write_consistency": "ONE", "pg_ro_user": "user1", "cassandra_read_consistency": "ONE", "cassandra_lb_policy": "RequestRoundRobin", "cassandra_keyspace": "kong", "cassandra_refresh_frequency": 60, "error_default_type": "text/plain", "cassandra_username": "kong", "cassandra_repl_factor": 1, "cassandra_data_centers": [ "dc1:2", "dc2:3" ], "cassandra_schema_consensus_timeout": 10000, "ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3", "nginx_http_ssl_protocols": "TLSv1.2 TLSv1.3", "nginx_stream_ssl_protocols": "TLSv1.2 TLSv1.3", "worker_consistency": "strict", "nginx_http_ssl_prefer_server_ciphers": "off", "nginx_stream_ssl_prefer_server_ciphers": "off", "ssl_dhparam": "ffdhe2048", "pluginserver_names": {}, "nginx_stream_ssl_dhparam": "ffdhe2048", "ssl_session_tickets": "on", "nginx_http_ssl_session_tickets": "on", "nginx_stream_ssl_session_tickets": "on", "stream_proxy_ssl_enabled": false, "go_plugins_dir": "off", "prefix": "/kong_prefix", "go_pluginserver_exe": "/usr/local/bin/go-pluginserver", "lua_package_path": "/opt/?.lua;/opt/?/init.lua;;", "db_cache_warmup_entities": [ "services" ], "status_listeners": [ { "ssl": false, "port": 8100, "listener": "0.0.0.0:8100", "ip": "0.0.0.0" }, { "ssl": true, "port": 8543, "listener": "0.0.0.0:8543 ssl", "ip": "0.0.0.0" } ], "proxy_ssl_enabled": true, "pg_timeout": 5000, "role": "traditional", "ssl_cert_key_default": "/kong_prefix/ssl/kong-default.key", "headers": [ "server_tokens", "latency_tokens", "X-Kong-Upstream-Status" ], "proxy_stream_error_log": "logs/error.log", "ssl_cert_default": "/kong_prefix/ssl/kong-default.crt", "nginx_http_lua_shared_dict": "prometheus_metrics 100m", "ssl_cert_key_default_ecdsa": "/kong_prefix/ssl/kong-default-ecdsa.key", "loaded_vaults": {}, "plugins": [ "bundled", "json-threat-protection", "upstream-oauth2", "upstream-basic-auth", "payload-logging", "custom-log", "upstream-jwt", "upstream-xdt-oauth2", "set-target-host", "hmac-url-auth", "inmar-hmac-auth", "oauth-consumer", "signnow-hmac-auth", "jwt-auth", "batch-rate-limiting" ], "admin_error_log": "/dev/stderr", "loaded_plugins": { "response-ratelimiting": true, "syslog": true, "loggly": true, "datadog": true, "ldap-auth": true, "statsd": true, "bot-detection": true, "aws-lambda": true, "request-termination": true, "prometheus": true, "proxy-cache": true, "session": true, "acme": true, "grpc-gateway": true, "grpc-web": true, "pre-function": true, "post-function": true, "azure-functions": true, "zipkin": true, "signnow-hmac-auth": true, "jwt-auth": true, "batch-rate-limiting": true, "upstream-oauth2": true, "upstream-basic-auth": true, "payload-logging": true, "custom-log": true, "upstream-jwt": true, "upstream-xdt-oauth2": true, "set-target-host": true, "hmac-url-auth": true, "inmar-hmac-auth": true, "oauth-consumer": true, "json-threat-protection": true, "jwt": true, "acl": true, "correlation-id": true, "cors": true, "oauth2": true, "tcp-log": true, "udp-log": true, "file-log": true, "http-log": true, "key-auth": true, "hmac-auth": true, "basic-auth": true, "ip-restriction": true, "request-transformer": true, "response-transformer": true, "request-size-limiting": true, "rate-limiting": true }, "proxy_error_log": "/dev/stderr", "host_ports": { "8443": 443, "8443": 443 }, "anonymous_reports": false, "real_ip_header": "X-Real-IP", "database": "postgres", "ssl_prefer_server_ciphers": "on", "status_ssl_cert_key_default_ecdsa": "/kong_prefix/ssl/status-kong-default-ecdsa.key", "db_resurrect_ttl": 30, "ssl_session_timeout": "1d", "nginx_http_ssl_session_timeout": "1d", "nginx_stream_ssl_session_timeout": "1d", "proxy_access_log": "/dev/stdout", "worker_state_update_frequency": 5, "proxy_stream_access_log": "logs/access.log basic", "dns_resolver": {}, "admin_access_log": "/dev/stdout", "dns_hostsfile": "/etc/hosts", "status_access_log": "off", "status_error_log": "logs/status_error.log", "log_level": "error", "cluster_listeners": {}, "status_ssl_enabled": true, "vaults": [ "off" ], "nginx_optimizations": true, "lua_ssl_trusted_certificate": {}, "lua_ssl_verify_depth": 1, "lua_ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3", "nginx_http_lua_ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3", "nginx_stream_lua_ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3", "lua_socket_pool_size": 30, "db_update_frequency": 5, "cluster_control_plane": "127.0.0.1:8005", "cassandra_repl_strategy": "SimpleStrategy", "mem_cache_size": "3000m", "cluster_mtls": "shared", "ssl_cert_key": [ "/etc/secrets/proxy-tls/tls.key" ], "db_cache_ttl": 0, "nginx_main_directives": [ { "value": "off", "name": "daemon" }, { "value": "KONG_REDIS_HOST;env KONG_REDIS_PORT;env KONG_REDIS_PASSWORD", "name": "env" }, { "value": "kong kong", "name": "user" }, { "value": "2", "name": "worker_processes" }, { "value": "auto", "name": "worker_rlimit_nofile" } ], "cluster_data_plane_purge_delay": 1209600, "cluster_ocsp": "off", "cluster_max_payload": 4194304, "untrusted_lua": "sandbox", "nginx_http_directives": [ { "value": "8k", "name": "client_body_buffer_size" }, { "value": "0", "name": "client_max_body_size" }, { "value": "100000", "name": "lua_regex_match_limit" }, { "value": "prometheus_metrics 100m", "name": "lua_shared_dict" }, { "value": "TLSv1.1 TLSv1.2 TLSv1.3", "name": "lua_ssl_protocols" }, { "value": "/kong_prefix/ssl/ffdhe2048.pem", "name": "ssl_dhparam" }, { "value": "off", "name": "ssl_prefer_server_ciphers" }, { "value": "TLSv1.2 TLSv1.3", "name": "ssl_protocols" }, { "value": "on", "name": "ssl_session_tickets" }, { "value": "1d", "name": "ssl_session_timeout" } ], "untrusted_lua_sandbox_requires": {}, "untrusted_lua_sandbox_environment": {}, "client_ssl": false, "upstream_keepalive_max_requests": 100, "nginx_upstream_directives": {}, "pg_ro_database": "hawkeye", "nginx_proxy_directives": [ { "value": "X-Real-IP", "name": "real_ip_header" }, { "value": "off", "name": "real_ip_recursive" } ], "pg_database": "hawkeye", "nginx_status_directives": {}, "admin_ssl_cert_key": [ "/kong_prefix/ssl/admin-kong-default.key", "/kong_prefix/ssl/admin-kong-default-ecdsa.key" ], "nginx_admin_directives": [ { "value": "10m", "name": "client_body_buffer_size" }, { "value": "10m", "name": "client_max_body_size" } ], "nginx_events_directives": [ { "value": "on", "name": "multi_accept" }, { "value": "auto", "name": "worker_connections" } ], "nginx_stream_directives": [ { "value": "stream_prometheus_metrics 5m", "name": "lua_shared_dict" }, { "value": "TLSv1.1 TLSv1.2 TLSv1.3", "name": "lua_ssl_protocols" }, { "value": "/kong_prefix/ssl/ffdhe2048.pem", "name": "ssl_dhparam" }, { "value": "off", "name": "ssl_prefer_server_ciphers" }, { "value": "TLSv1.2 TLSv1.3", "name": "ssl_protocols" }, { "value": "on", "name": "ssl_session_tickets" }, { "value": "1d", "name": "ssl_session_timeout" } ], "pg_host": "hawkeye-haproxy-svc.hawkeye-haproxy-ns.svc.cluster.local", "nginx_supstream_directives": {}, "db_update_propagation": 0, "nginx_sproxy_directives": {}, "proxy_listeners": [ { "ssl": true, "port": 8443, "bind": false, "listener": "0.0.0.0:8443 ssl http2", "deferred": false, "reuseport": false, "backlog=%d+": false, "proxy_protocol": false, "http2": true, "ip": "0.0.0.0" } ], "stream_listeners": {}, "admin_listeners": [ { "ssl": true, "port": 8444, "bind": false, "listener": "0.0.0.0:8444 ssl", "deferred": false, "reuseport": false, "backlog=%d+": false, "proxy_protocol": false, "http2": false, "ip": "0.0.0.0" } ], "admin_ssl_enabled": true, "nginx_http_upstream_directives": {}, "nginx_http_status_directives": {}, "nginx_http_ssl_dhparam": "ffdhe2048", "nginx_pid": "/kong_prefix/pids/nginx.pid", "nginx_http_lua_regex_match_limit": "100000", "status_ssl_cert_default_ecdsa": "/kong_prefix/ssl/status-kong-default-ecdsa.crt", "nginx_err_logs": "/kong_prefix/logs/error.log", "nginx_main_env": "KONG_REDIS_HOST;env KONG_REDIS_PORT;env KONG_REDIS_PASSWORD", "cassandra_timeout": 5000, "nginx_acc_logs": "/kong_prefix/logs/access.log", "pg_ro_ssl_verify": false, "admin_acc_logs": "/kong_prefix/logs/admin_access.log", "ssl_cert_csr_default": "/kong_prefix/ssl/kong-default.csr", "nginx_conf": "/kong_prefix/nginx.conf", "ssl_cert_default_ecdsa": "/kong_prefix/ssl/kong-default-ecdsa.crt" }, "version": "2.8.3", "tagline": "Welcome to kong", "lua_version": "LuaJIT 2.1.0-beta3", "pids": { "workers": [ 1108, 1109 ], "master": 1 }, "hostname": "hawkeye-kong-c75b565dc-gkkkn", "node_id": "4517d36f-f53a-4514-b269-5046b3d13f83", "plugins": { "available_on_server": { "response-ratelimiting": true, "syslog": true, "loggly": true, "datadog": true, "ldap-auth": true, "statsd": true, "bot-detection": true, "aws-lambda": true, "request-termination": true, "prometheus": true, "proxy-cache": true, "session": true, "acme": true, "grpc-gateway": true, "grpc-web": true, "pre-function": true, "post-function": true, "azure-functions": true, "zipkin": true, "signnow-hmac-auth": true, "jwt-auth": true, "batch-rate-limiting": true, "oauth-consumer": true, "inmar-hmac-auth": true, "hmac-url-auth": true, "set-target-host": true, "upstream-xdt-oauth2": true, "upstream-jwt": true, "custom-log": true, "payload-logging": true, "upstream-basic-auth": true, "upstream-oauth2": true, "json-threat-protection": true, "jwt": true, "acl": true, "correlation-id": true, "cors": true, "oauth2": true, "tcp-log": true, "udp-log": true, "file-log": true, "http-log": true, "key-auth": true, "hmac-auth": true, "basic-auth": true, "ip-restriction": true, "request-transformer": true, "response-transformer": true, "request-size-limiting": true, "rate-limiting": true }, "enabled_in_cluster": [ "jwt-auth", "acl", "batch-rate-limiting", "upstream-basic-auth", "request-termination", "json-threat-protection", "request-transformer", "upstream-oauth2", "set-target-host", "jwt", "upstream-jwt", "signnow-hmac-auth", "request-size-limiting", "upstream-xdt-oauth2", "cors", "oauth2", "inmar-hmac-auth", "payload-logging", "custom-log", "basic-auth", "prometheus", "hmac-url-auth" ] }, "timers": { "running": 0, "pending": 8 } } ``` Status page > ``` { "database": { "reachable": true }, "server": { "connections_writing": 2, "total_requests": 10658, "connections_waiting": 10, "connections_handled": 1828, "connections_accepted": 1828, "connections_active": 12, "connections_reading": 0 }, "memory": { "workers_lua_vms": [ { "pid": 1108, "http_allocated_gc": "115.50 MiB" }, { "pid": 1109, "http_allocated_gc": "108.54 MiB" } ], "lua_shared_dicts": { "kong": { "capacity": "5.00 MiB", "allocated_slabs": "0.04 MiB" }, "kong_locks": { "capacity": "8.00 MiB", "allocated_slabs": "0.06 MiB" }, "kong_healthchecks": { "capacity": "5.00 MiB", "allocated_slabs": "0.04 MiB" }, "kong_process_events": { "capacity": "5.00 MiB", "allocated_slabs": "0.05 MiB" }, "kong_cluster_events": { "capacity": "5.00 MiB", "allocated_slabs": "0.04 MiB" }, "kong_rate_limiting_counters": { "capacity": "12.00 MiB", "allocated_slabs": "0.09 MiB" }, "kong_core_db_cache": { "capacity": "3000.00 MiB", "allocated_slabs": "24.14 MiB" }, "kong_core_db_cache_miss": { "capacity": "12.00 MiB", "allocated_slabs": "0.09 MiB" }, "kong_db_cache": { "capacity": "3000.00 MiB", "allocated_slabs": "17.58 MiB" }, "kong_db_cache_miss": { "capacity": "12.00 MiB", "allocated_slabs": "0.08 MiB" }, "prometheus_metrics": { "capacity": "100.00 MiB", "allocated_slabs": "5.20 MiB" } } } } ```

Kong / kong

Kong proxy goes unresponsive with normal traffic - #8152 issue - kubelet restarts the container since liveness probe fails #11710

Discussed in https://github.com/Kong/kong/discussions/11709

Changed ndots from 5 to 2

dns_valid_ttl: 53

dns_valid_ttl: 30 dns_no_sync: on

dns_valid_ttl: 53 dns_no_sync: on

Reverted dns_valid_ttl and dns_no_sync dns_order: LAST,A,CNAME

Reverted dns_order dns_cache_size: 100000

Reverted dns_cache_size dns_stale_ttl: 120

REverted dns_stale_ttl dns_stale_ttl: 127

Reverted all DNS changes

upstream_keepalive_max_requests: 200 upstream_keepalive_pool_size: 120 upstream_keepalive_idle_timeout: 30 lua_socket_pool_size: 60

dns_not_found_ttl: 300

Reverted dns_not_found_ttl dns_error_ttl: 30