Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/#kong-community
Apache License 2.0
39.22k stars 4.81k forks source link

Kong proxy goes unresponsive with normal traffic - #8152 issue - kubelet restarts the container since liveness probe fails #11710

Closed kranthirachakonda closed 1 year ago

kranthirachakonda commented 1 year ago

Discussed in https://github.com/Kong/kong/discussions/11709

Originally posted by **kranthirachakonda** October 7, 2023 Looks like we are running into Kong [#8152](https://github.com/Kong/kong/issues/8152) issue with 2.8.3 and also 3.0.1. We had new route/service added recently and there is increase in traffic but not huge. May ~15% increase. Only proxy container inside Kong pods gets restart due to /status timing out. We tried changing the timeout/period intervals but that didnt help much. We dont see our containers going above the cpu/memory requests. Behavior is identical, where kong proxy container goes into unresponsive mode for about 4-5 mins and processes again. We also observed that while I gave memory requests of 4G, proxy container never goes above 512Mi. Not sure if there are default limits within LuaVM or threads etc. Can you please help me in debugging further. db-less kong:2.8.3 ``` Limits: cpu: 2500m memory: 6Gi Requests: cpu: 1 memory: 4Gi Liveness: http-get http://:status/status delay=5s timeout=5s period=10s #success=1 #failure=3 ``` kubernetes-ingress-controller:2.8.2 ``` Limits: cpu: 2 memory: 2Gi Requests: cpu: 200m memory: 1Gi Liveness: http-get http://:10254/healthz delay=5s timeout=5s period=10s #success=1 #failure=3 ``` Events: ``` Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 4m7s (x94 over 20h) kubelet Liveness probe failed: Get "http://192.168.10.79:8100/status": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Normal Killing 4m7s (x27 over 20h) kubelet Container proxy failed liveness probe, will be restarted ``` These are few errors I see while proxy containers goes into a condition where status page fails to respond within 5s, causing kubelet to restart when liveness probe fails. ``` [lua] targets.lua:248: could not reschedule DNS resolver timer: process exiting, context: ngx.timer [lua] init.lua:409: [cluster_events] failed to start recurring polling timer: process exiting, context: ngx.timer ``` Admin page ``` { "configuration": { "nginx_kong_conf": "/kong_prefix/nginx-kong.conf", "nginx_kong_stream_conf": "/kong_prefix/nginx-kong-stream.conf", "dns_error_ttl": 1, "kong_env": "/kong_prefix/.kong_env", "dns_not_found_ttl": 30, "dns_stale_ttl": 4, "dns_cache_size": 10000, "dns_order": [ "LAST", "SRV", "A", "CNAME" ], "dns_no_sync": false, "upstream_keepalive_pool_size": 60, "client_ssl_cert_default": "/kong_prefix/ssl/kong-default.crt", "client_ssl_cert_key_default": "/kong_prefix/ssl/kong-default.key", "admin_ssl_cert_default": "/kong_prefix/ssl/admin-kong-default.crt", "admin_ssl_cert_key_default": "/kong_prefix/ssl/admin-kong-default.key", "admin_ssl_cert_default_ecdsa": "/kong_prefix/ssl/admin-kong-default-ecdsa.crt", "ssl_cipher_suite": "intermediate", "admin_ssl_cert_key_default_ecdsa": "/kong_prefix/ssl/admin-kong-default-ecdsa.key", "ssl_ciphers": "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384", "status_ssl_cert_default": "/kong_prefix/ssl/status-kong-default.crt", "status_ssl_cert_key_default": "/kong_prefix/ssl/status-kong-default.key", "nginx_main_worker_rlimit_nofile": "auto", "upstream_keepalive_idle_timeout": 60, "nginx_events_worker_connections": "auto", "nginx_events_multi_accept": "on", "lua_package_cpath": "", "port_maps": [ "443:8443" ], "ssl_cert": [ "/etc/secrets/proxy-tls/tls.crt" ], "proxy_listen": [ "0.0.0.0:8443 http2 ssl" ], "admin_listen": [ "0.0.0.0:8444 ssl" ], "status_listen": [ "0.0.0.0:8100", "0.0.0.0:8543 ssl" ], "stream_listen": [ "off" ], "cluster_listen": [ "off" ], "admin_ssl_cert": [ "/kong_prefix/ssl/admin-kong-default.crt", "/kong_prefix/ssl/admin-kong-default-ecdsa.crt" ], "nginx_admin_client_max_body_size": "10m", "status_ssl_cert": [ "/kong_prefix/ssl/status-kong-default.crt", "/kong_prefix/ssl/status-kong-default-ecdsa.crt" ], "status_ssl_cert_key": [ "/kong_prefix/ssl/status-kong-default.key", "/kong_prefix/ssl/status-kong-default-ecdsa.key" ], "nginx_admin_client_body_buffer_size": "10m", "nginx_user": "kong kong", "nginx_main_user": "kong kong", "nginx_daemon": "off", "nginx_main_daemon": "off", "nginx_worker_processes": "2", "nginx_main_worker_processes": "2", "trusted_ips": {}, "kic": true, "nginx_proxy_real_ip_header": "X-Real-IP", "real_ip_recursive": "off", "nginx_proxy_real_ip_recursive": "off", "client_max_body_size": "0", "nginx_http_client_max_body_size": "0", "client_body_buffer_size": "8k", "nginx_http_client_body_buffer_size": "8k", "pg_port": 50001, "pg_password": "******", "pg_ssl": true, "pg_ssl_verify": false, "pg_max_concurrent_queries": 0, "pg_semaphore_timeout": 60000, "pg_ro_port": 50002, "pg_user": "user1", "pg_ro_password": "******", "pg_ro_ssl": true, "enabled_headers": { "server_tokens": true, "latency_tokens": true, "Via": true, "Server": true, "X-Kong-Proxy-Latency": true, "X-Kong-Response-Latency": true, "X-Kong-Admin-Latency": true, "X-Kong-Upstream-Status": true, "X-Kong-Upstream-Latency": true }, "cassandra_contact_points": [ "127.0.0.1" ], "cassandra_port": 9042, "pg_ro_host": "hawkeye-haproxy-svc.hawkeye-haproxy-ns.svc.cluster.local", "cassandra_ssl": false, "cassandra_ssl_verify": false, "cassandra_write_consistency": "ONE", "pg_ro_user": "user1", "cassandra_read_consistency": "ONE", "cassandra_lb_policy": "RequestRoundRobin", "cassandra_keyspace": "kong", "cassandra_refresh_frequency": 60, "error_default_type": "text/plain", "cassandra_username": "kong", "cassandra_repl_factor": 1, "cassandra_data_centers": [ "dc1:2", "dc2:3" ], "cassandra_schema_consensus_timeout": 10000, "ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3", "nginx_http_ssl_protocols": "TLSv1.2 TLSv1.3", "nginx_stream_ssl_protocols": "TLSv1.2 TLSv1.3", "worker_consistency": "strict", "nginx_http_ssl_prefer_server_ciphers": "off", "nginx_stream_ssl_prefer_server_ciphers": "off", "ssl_dhparam": "ffdhe2048", "pluginserver_names": {}, "nginx_stream_ssl_dhparam": "ffdhe2048", "ssl_session_tickets": "on", "nginx_http_ssl_session_tickets": "on", "nginx_stream_ssl_session_tickets": "on", "stream_proxy_ssl_enabled": false, "go_plugins_dir": "off", "prefix": "/kong_prefix", "go_pluginserver_exe": "/usr/local/bin/go-pluginserver", "lua_package_path": "/opt/?.lua;/opt/?/init.lua;;", "db_cache_warmup_entities": [ "services" ], "status_listeners": [ { "ssl": false, "port": 8100, "listener": "0.0.0.0:8100", "ip": "0.0.0.0" }, { "ssl": true, "port": 8543, "listener": "0.0.0.0:8543 ssl", "ip": "0.0.0.0" } ], "proxy_ssl_enabled": true, "pg_timeout": 5000, "role": "traditional", "ssl_cert_key_default": "/kong_prefix/ssl/kong-default.key", "headers": [ "server_tokens", "latency_tokens", "X-Kong-Upstream-Status" ], "proxy_stream_error_log": "logs/error.log", "ssl_cert_default": "/kong_prefix/ssl/kong-default.crt", "nginx_http_lua_shared_dict": "prometheus_metrics 100m", "ssl_cert_key_default_ecdsa": "/kong_prefix/ssl/kong-default-ecdsa.key", "loaded_vaults": {}, "plugins": [ "bundled", "json-threat-protection", "upstream-oauth2", "upstream-basic-auth", "payload-logging", "custom-log", "upstream-jwt", "upstream-xdt-oauth2", "set-target-host", "hmac-url-auth", "inmar-hmac-auth", "oauth-consumer", "signnow-hmac-auth", "jwt-auth", "batch-rate-limiting" ], "admin_error_log": "/dev/stderr", "loaded_plugins": { "response-ratelimiting": true, "syslog": true, "loggly": true, "datadog": true, "ldap-auth": true, "statsd": true, "bot-detection": true, "aws-lambda": true, "request-termination": true, "prometheus": true, "proxy-cache": true, "session": true, "acme": true, "grpc-gateway": true, "grpc-web": true, "pre-function": true, "post-function": true, "azure-functions": true, "zipkin": true, "signnow-hmac-auth": true, "jwt-auth": true, "batch-rate-limiting": true, "upstream-oauth2": true, "upstream-basic-auth": true, "payload-logging": true, "custom-log": true, "upstream-jwt": true, "upstream-xdt-oauth2": true, "set-target-host": true, "hmac-url-auth": true, "inmar-hmac-auth": true, "oauth-consumer": true, "json-threat-protection": true, "jwt": true, "acl": true, "correlation-id": true, "cors": true, "oauth2": true, "tcp-log": true, "udp-log": true, "file-log": true, "http-log": true, "key-auth": true, "hmac-auth": true, "basic-auth": true, "ip-restriction": true, "request-transformer": true, "response-transformer": true, "request-size-limiting": true, "rate-limiting": true }, "proxy_error_log": "/dev/stderr", "host_ports": { "8443": 443, "8443": 443 }, "anonymous_reports": false, "real_ip_header": "X-Real-IP", "database": "postgres", "ssl_prefer_server_ciphers": "on", "status_ssl_cert_key_default_ecdsa": "/kong_prefix/ssl/status-kong-default-ecdsa.key", "db_resurrect_ttl": 30, "ssl_session_timeout": "1d", "nginx_http_ssl_session_timeout": "1d", "nginx_stream_ssl_session_timeout": "1d", "proxy_access_log": "/dev/stdout", "worker_state_update_frequency": 5, "proxy_stream_access_log": "logs/access.log basic", "dns_resolver": {}, "admin_access_log": "/dev/stdout", "dns_hostsfile": "/etc/hosts", "status_access_log": "off", "status_error_log": "logs/status_error.log", "log_level": "error", "cluster_listeners": {}, "status_ssl_enabled": true, "vaults": [ "off" ], "nginx_optimizations": true, "lua_ssl_trusted_certificate": {}, "lua_ssl_verify_depth": 1, "lua_ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3", "nginx_http_lua_ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3", "nginx_stream_lua_ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3", "lua_socket_pool_size": 30, "db_update_frequency": 5, "cluster_control_plane": "127.0.0.1:8005", "cassandra_repl_strategy": "SimpleStrategy", "mem_cache_size": "3000m", "cluster_mtls": "shared", "ssl_cert_key": [ "/etc/secrets/proxy-tls/tls.key" ], "db_cache_ttl": 0, "nginx_main_directives": [ { "value": "off", "name": "daemon" }, { "value": "KONG_REDIS_HOST;env KONG_REDIS_PORT;env KONG_REDIS_PASSWORD", "name": "env" }, { "value": "kong kong", "name": "user" }, { "value": "2", "name": "worker_processes" }, { "value": "auto", "name": "worker_rlimit_nofile" } ], "cluster_data_plane_purge_delay": 1209600, "cluster_ocsp": "off", "cluster_max_payload": 4194304, "untrusted_lua": "sandbox", "nginx_http_directives": [ { "value": "8k", "name": "client_body_buffer_size" }, { "value": "0", "name": "client_max_body_size" }, { "value": "100000", "name": "lua_regex_match_limit" }, { "value": "prometheus_metrics 100m", "name": "lua_shared_dict" }, { "value": "TLSv1.1 TLSv1.2 TLSv1.3", "name": "lua_ssl_protocols" }, { "value": "/kong_prefix/ssl/ffdhe2048.pem", "name": "ssl_dhparam" }, { "value": "off", "name": "ssl_prefer_server_ciphers" }, { "value": "TLSv1.2 TLSv1.3", "name": "ssl_protocols" }, { "value": "on", "name": "ssl_session_tickets" }, { "value": "1d", "name": "ssl_session_timeout" } ], "untrusted_lua_sandbox_requires": {}, "untrusted_lua_sandbox_environment": {}, "client_ssl": false, "upstream_keepalive_max_requests": 100, "nginx_upstream_directives": {}, "pg_ro_database": "hawkeye", "nginx_proxy_directives": [ { "value": "X-Real-IP", "name": "real_ip_header" }, { "value": "off", "name": "real_ip_recursive" } ], "pg_database": "hawkeye", "nginx_status_directives": {}, "admin_ssl_cert_key": [ "/kong_prefix/ssl/admin-kong-default.key", "/kong_prefix/ssl/admin-kong-default-ecdsa.key" ], "nginx_admin_directives": [ { "value": "10m", "name": "client_body_buffer_size" }, { "value": "10m", "name": "client_max_body_size" } ], "nginx_events_directives": [ { "value": "on", "name": "multi_accept" }, { "value": "auto", "name": "worker_connections" } ], "nginx_stream_directives": [ { "value": "stream_prometheus_metrics 5m", "name": "lua_shared_dict" }, { "value": "TLSv1.1 TLSv1.2 TLSv1.3", "name": "lua_ssl_protocols" }, { "value": "/kong_prefix/ssl/ffdhe2048.pem", "name": "ssl_dhparam" }, { "value": "off", "name": "ssl_prefer_server_ciphers" }, { "value": "TLSv1.2 TLSv1.3", "name": "ssl_protocols" }, { "value": "on", "name": "ssl_session_tickets" }, { "value": "1d", "name": "ssl_session_timeout" } ], "pg_host": "hawkeye-haproxy-svc.hawkeye-haproxy-ns.svc.cluster.local", "nginx_supstream_directives": {}, "db_update_propagation": 0, "nginx_sproxy_directives": {}, "proxy_listeners": [ { "ssl": true, "port": 8443, "bind": false, "listener": "0.0.0.0:8443 ssl http2", "deferred": false, "reuseport": false, "backlog=%d+": false, "proxy_protocol": false, "http2": true, "ip": "0.0.0.0" } ], "stream_listeners": {}, "admin_listeners": [ { "ssl": true, "port": 8444, "bind": false, "listener": "0.0.0.0:8444 ssl", "deferred": false, "reuseport": false, "backlog=%d+": false, "proxy_protocol": false, "http2": false, "ip": "0.0.0.0" } ], "admin_ssl_enabled": true, "nginx_http_upstream_directives": {}, "nginx_http_status_directives": {}, "nginx_http_ssl_dhparam": "ffdhe2048", "nginx_pid": "/kong_prefix/pids/nginx.pid", "nginx_http_lua_regex_match_limit": "100000", "status_ssl_cert_default_ecdsa": "/kong_prefix/ssl/status-kong-default-ecdsa.crt", "nginx_err_logs": "/kong_prefix/logs/error.log", "nginx_main_env": "KONG_REDIS_HOST;env KONG_REDIS_PORT;env KONG_REDIS_PASSWORD", "cassandra_timeout": 5000, "nginx_acc_logs": "/kong_prefix/logs/access.log", "pg_ro_ssl_verify": false, "admin_acc_logs": "/kong_prefix/logs/admin_access.log", "ssl_cert_csr_default": "/kong_prefix/ssl/kong-default.csr", "nginx_conf": "/kong_prefix/nginx.conf", "ssl_cert_default_ecdsa": "/kong_prefix/ssl/kong-default-ecdsa.crt" }, "version": "2.8.3", "tagline": "Welcome to kong", "lua_version": "LuaJIT 2.1.0-beta3", "pids": { "workers": [ 1108, 1109 ], "master": 1 }, "hostname": "hawkeye-kong-c75b565dc-gkkkn", "node_id": "4517d36f-f53a-4514-b269-5046b3d13f83", "plugins": { "available_on_server": { "response-ratelimiting": true, "syslog": true, "loggly": true, "datadog": true, "ldap-auth": true, "statsd": true, "bot-detection": true, "aws-lambda": true, "request-termination": true, "prometheus": true, "proxy-cache": true, "session": true, "acme": true, "grpc-gateway": true, "grpc-web": true, "pre-function": true, "post-function": true, "azure-functions": true, "zipkin": true, "signnow-hmac-auth": true, "jwt-auth": true, "batch-rate-limiting": true, "oauth-consumer": true, "inmar-hmac-auth": true, "hmac-url-auth": true, "set-target-host": true, "upstream-xdt-oauth2": true, "upstream-jwt": true, "custom-log": true, "payload-logging": true, "upstream-basic-auth": true, "upstream-oauth2": true, "json-threat-protection": true, "jwt": true, "acl": true, "correlation-id": true, "cors": true, "oauth2": true, "tcp-log": true, "udp-log": true, "file-log": true, "http-log": true, "key-auth": true, "hmac-auth": true, "basic-auth": true, "ip-restriction": true, "request-transformer": true, "response-transformer": true, "request-size-limiting": true, "rate-limiting": true }, "enabled_in_cluster": [ "jwt-auth", "acl", "batch-rate-limiting", "upstream-basic-auth", "request-termination", "json-threat-protection", "request-transformer", "upstream-oauth2", "set-target-host", "jwt", "upstream-jwt", "signnow-hmac-auth", "request-size-limiting", "upstream-xdt-oauth2", "cors", "oauth2", "inmar-hmac-auth", "payload-logging", "custom-log", "basic-auth", "prometheus", "hmac-url-auth" ] }, "timers": { "running": 0, "pending": 8 } } ``` Status page > ``` { "database": { "reachable": true }, "server": { "connections_writing": 2, "total_requests": 10658, "connections_waiting": 10, "connections_handled": 1828, "connections_accepted": 1828, "connections_active": 12, "connections_reading": 0 }, "memory": { "workers_lua_vms": [ { "pid": 1108, "http_allocated_gc": "115.50 MiB" }, { "pid": 1109, "http_allocated_gc": "108.54 MiB" } ], "lua_shared_dicts": { "kong": { "capacity": "5.00 MiB", "allocated_slabs": "0.04 MiB" }, "kong_locks": { "capacity": "8.00 MiB", "allocated_slabs": "0.06 MiB" }, "kong_healthchecks": { "capacity": "5.00 MiB", "allocated_slabs": "0.04 MiB" }, "kong_process_events": { "capacity": "5.00 MiB", "allocated_slabs": "0.05 MiB" }, "kong_cluster_events": { "capacity": "5.00 MiB", "allocated_slabs": "0.04 MiB" }, "kong_rate_limiting_counters": { "capacity": "12.00 MiB", "allocated_slabs": "0.09 MiB" }, "kong_core_db_cache": { "capacity": "3000.00 MiB", "allocated_slabs": "24.14 MiB" }, "kong_core_db_cache_miss": { "capacity": "12.00 MiB", "allocated_slabs": "0.09 MiB" }, "kong_db_cache": { "capacity": "3000.00 MiB", "allocated_slabs": "17.58 MiB" }, "kong_db_cache_miss": { "capacity": "12.00 MiB", "allocated_slabs": "0.08 MiB" }, "prometheus_metrics": { "capacity": "100.00 MiB", "allocated_slabs": "5.20 MiB" } } } } ```
hanshuebner commented 1 year ago

@kranthirachakonda How large is your configuration (number of routes/services/consumers), roughly?

kranthirachakonda commented 1 year ago

@kranthirachakonda How large is your configuration (number of routes/services/consumers), roughly?

@hanshuebner approx Routes - 900 Services - 900 Consumers - 100 Plugins - 2900

hanshuebner commented 1 year ago

@kranthirachakonda The number of Kong entities seems to be moderate, so it is not a sizing issue. Are you able to monitor DNS traffic made by the Kong pod?

kranthirachakonda commented 1 year ago

@kranthirachakonda The number of Kong entities seems to be moderate, so it is not a sizing issue. Are you able to monitor DNS traffic made by the Kong pod?

@hanshuebner @nowNick - Yes, dns traffic increased and we tried below and none helped in stopping from kong proxy going into unresponsive mode.

Changed ndots from 5 to 2

dns_valid_ttl: 53

dns_valid_ttl: 30 dns_no_sync: on

dns_valid_ttl: 53 dns_no_sync: on

Reverted dns_valid_ttl and dns_no_sync dns_order: LAST,A,CNAME

Reverted dns_order dns_cache_size: 100000

Reverted dns_cache_size dns_stale_ttl: 120

REverted dns_stale_ttl dns_stale_ttl: 127

Reverted all DNS changes

upstream_keepalive_max_requests: 200 upstream_keepalive_pool_size: 120 upstream_keepalive_idle_timeout: 30 lua_socket_pool_size: 60

dns_not_found_ttl: 300

Reverted dns_not_found_ttl dns_error_ttl: 30

Reverted dns_error_ttl lua_socket_pool_size: 127

hanshuebner commented 1 year ago

@kranthirachakonda Can you provide us with some information regarding the DNS traffic that you see? Are the requests that are sent all different, or are there multiple requests for the same name? How many requests do you see?

kranthirachakonda commented 1 year ago

@kranthirachakonda Can you provide us with some information regarding the DNS traffic that you see? Are the requests that are sent all different, or are there multiple requests for the same name? How many requests do you see?

We are able to reproduce the issue in our non-prod environment also , with couple of service endpoints with mix of internal and external hostnames (fqdn).

I feel when one/few of the problematic route/service is invoked worker processes or either running out of timers or some capacity which is making it to unresponsive. Meaning at that scenario /status page or any service doesnt respond for ~45 secs. To test our theory we increased liveness probes period to very long so that kubelet doesnt restart and saw that after 4-5 mins worker processes recovered on its own. So can you please help me figure out if I am running into some resource constraints and what those could be?

Based on our grafana chart and real-time top I dont see high CPU/Memory - CPU max is 500m and Memory max is 512Mi Not sure why it doesnt grow beyond those values.

The DNS traffic and traffic onto the kong proxy is normal e.g.

image

Kong-Nginx timers - I see lots of pending state

image
hanshuebner commented 1 year ago

Kong-Nginx timers - I see lots of pending state

Where do you see those?

When Kong hangs, do you see a lot of open network connections?

kranthirachakonda commented 1 year ago

Kong-Nginx timers - I see lots of pending state

Where do you see those? Nginx timers in grafana dashboard

When Kong hangs, do you see a lot of open network connections?

Ya about 200 time_waits

I see http latency about 100s for few calls, and some times 100% cpu usage for the kong proxy container alone. I am able to reproduce same issue even in 3.2.2 version also. Any help on how i can debug further.

kranthirachakonda commented 1 year ago

We found the issue with one of the external api timeouts and our custom plugin’s which caused worker processes to go into cpu 100%. Updating those fixed our issue.

alexandresavicki commented 6 months ago

@kranthirachakonda Can you please share how do you identify the root cause? I think i'm facing the same issue hare, i'm newer with kong so it will be great if can you help me with some tricks to identify.