Prometheus plugin causes much longer serving duration for concurrent HTTP requests

seh commented 2 years ago

Is there an existing issue for this?

[X] I have searched the existing issues

Kong version (`$ kong version`)

Kong Enterprise 2.8.1.0

Current Behavior

When we activate Kong's Prometheus plugin in "global" mode—by labeling the KongClusterPlugin object with the "global" label with a value of "true"—and have Prometheus scrape the proxy container at 0.1 Hz (once every ten seconds), Kong will take an extra 1.5 seconds or so to serve many other requests that arrive concurrently with Kong responding to that scrape request. Note that the Kubernetes kubelet is also issuing probes to the Kong Admin API's /status HTTP path at the same frequency, but we don't find that those probes cause this problem when we stop Prometheus from scraping the /metrics HTTP path.

When Prometheus is not scraping Kong, and even in the approximately seven second period between when Kong starts and later finishes responding to the scrape request, we can issue thousands of requests against nonexistent routes, and Kong will respond with status code 404 within 80ms or less. For example, we might use the bombardier tool to issue three seconds' worth of requests, at around 10,000 requests per second, and Kong will never take more than 80ms to respond to any of them. However, if those requests overlap with the Prometheus scrape, then some of them—less than 1% of them as reported by bombardier—will take around 1.6 seconds.

Here is a sample invocation of bombardier, provoking 404 responses deliberately by requesting a virtual hostname for which we have no Kubernetes Ingress object and hence no route within Kong's proxy:

bombardier --header='Host: nonexistent.example.com' \
           --insecure \
           --header='Content-Type: application/json' \
           --body='{"elided": true}' \
           --method=POST \
           --disableKeepAlives \
           --duration=20s  \
           --latencies \
           https://100.118.12.10:8443

We can issue that from the same machine hosting the Kong proxy container in a Kubernetes pod. We use TLS here to mimic a common request pattern that our proxies actually serve daily.

If we patch the KongClusterPlugin for Prometheus and either set the "global" label's value to "false" or remove it entirely, strangely we don't see this problem abate. Even without the Prometheus plugin activated, Kong still serves metrics on the /metrics path; it just doesn't collect them for requests targeting covered routes and services. Only when we stop Prometheus from scraping Kong—or when we decrease its scrape frequency, elongating the periods between scrape requests—does the problem abate.

While running top on the same machine on which Kong's proxy is running, we can see the reported CPU use by each of the nginx processes swell from around 10% to 50% approximately every ten seconds, where it hovers before falling again about three seconds later.

I have studied the source code for the Prometheus plugin, looking for evidence that a long-running /metrics request could block concurrent adjustment of metrics while serving other requests concurrently. I do see that Prometheus:collect calls Prometheus:metric_data, which calls on self._counter.sync, and then calls self.dict:get_keys. That last call is hitting the ngx.shared.DICT.get_keys function, whose documentation mentions the following:

CAUTION Avoid calling this method on dictionaries with a very large number of keys as it may lock the dictionary for significant amount of time and block Nginx worker processes trying to access the dictionary.

Its implementation involves acquiring and holding a mutex on the shared memory pool. Having seen that, I went looking for some call path involved in updating metrics for other HTTP requests that NGINX is serving concurrently.

PrometheusHandler.log calls exporter.log. Within that latter log function, we find many calls to lua-resty-counter's incr function. That function only mutates what looks to be a thread-local dictionary, the content of which it only synchronizes periodically via an NGINX timer when calling the sync function. That sync function calls on lua-resty-counter's incr function repeatedly, the implementation of which also requires acquiring and holding a mutex. However, if those calls only arise in the context of an NGINX timer running in the background, presumably not interlocked with any normal HTTP request processing, it's not yet clear to me how the call to Prometheus:collect winds up getting blocked by any of this other per-worker synchronization with the shared memory dictionary. Am I missing some other call path involved in normal HTTP request processing that would also try to acquire this same mutex?

I also considered the two calls to ngx.sleep—one in prometheus.del, and one in prometheus.reset—as each of those could cause a stall for the hard-coded synchronization interval of one second. However, since the only metric that gets reset frequently is a gauge, we don't wind up calling on ngx.sleep.

Expected Behavior

It would be acceptable if using Kong's Prometheus plugin added a millisecond or two to the request serving duration, to allow for counter and gauge adjustments—ideally as atomic operations. We don't expect to suffer these drastic variations in request serving durations while Prometheus is scraping Kong's proxy.

Steps To Reproduce

Create a KongClusterPlugin with its "plugin" field set to "prometheus," labeled with "global" and a value of "true."
Configure Prometheus to scrape Kong's proxy once every ten seconds.
Inspect the Kong proxy's CPU use, and note the oscillation upward to approximately 50% at the same frequency as Prometheus's scraping.
Get close to Kong's proxy—ideally on the same machine—and issue requests that it can serve quickly, such as those that it rejects easily by responding with HTTP status code 404.
Confirm that Kong can serve these requests within a few milliseconds (less than one hundred, or a tenth of a second).
Issue progressively larger batches of these requests, noting the maximum duration to receive a response for each request.
Observe that this maximum duration spikes upward from tens of milliseconds to as long as two seconds when the requests overlap with Prometheus scraping Kong's proxy.

Anything else?

Operating system: Flatcar Container Linux version 2765.1.0, kernel version 5.10.16-flatcar Kong configuration:

HTTP GET against Admin port

```json { "lua_version": "LuaJIT 2.1.0-beta3", "tagline": "Welcome to kong", "plugins": { "available_on_server": { "ip-restriction": true, "request-transformer": true, "response-transformer": true, "request-size-limiting": true, "rate-limiting": true, "response-ratelimiting": true, "syslog": true, "loggly": true, "datadog": true, "ldap-auth": true, "statsd": true, "bot-detection": true, "aws-lambda": true, "request-termination": true, "prometheus": true, "proxy-cache": true, "session": true, "acme": true, "grpc-gateway": true, "grpc-web": true, "pre-function": true, "post-function": true, "azure-functions": true, "zipkin": true, "hasher": true, "kafka-upstream": true, "exit-transformer": true, "key-auth-enc": true, "upstream-timeout": true, "mocking": true, "opa": true, "jq": true, "jwt": true, "acl": true, "correlation-id": true, "cors": true, "oauth2": true, "tcp-log": true, "application-registration": true, "oauth2-introspection": true, "proxy-cache-advanced": true, "openid-connect": true, "forward-proxy": true, "canary": true, "request-transformer-advanced": true, "response-transformer-advanced": true, "rate-limiting-advanced": true, "ldap-auth-advanced": true, "statsd-advanced": true, "route-by-header": true, "jwt-signer": true, "vault-auth": true, "collector": true, "request-validator": true, "mtls-auth": true, "graphql-proxy-cache-advanced": true, "graphql-rate-limiting-advanced": true, "degraphql": true, "route-transformer-advanced": true, "kafka-log": true, "udp-log": true, "file-log": true, "http-log": true, "key-auth": true, "hmac-auth": true, "basic-auth": true }, "disabled_on_server": {}, "enabled_in_cluster": [ "zipkin", "acl", "basic-auth", "hasher", "request-transformer", "rate-limiting", "rate-limiting-advanced", "prometheus", "ip-restriction" ] }, "configuration": { "cassandra_repl_strategy": "SimpleStrategy", "cassandra_repl_factor": 1, "cassandra_data_centers": [ "dc1:2", "dc2:3" ], "cassandra_schema_consensus_timeout": 10000, "ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3", "nginx_http_ssl_protocols": "TLSv1.2 TLSv1.3", "nginx_stream_ssl_protocols": "TLSv1.2 TLSv1.3", "ssl_prefer_server_ciphers": "on", "nginx_http_ssl_prefer_server_ciphers": "off", "nginx_stream_ssl_prefer_server_ciphers": "off", "ssl_dhparam": "ffdhe2048", "nginx_http_ssl_dhparam": "ffdhe2048", "nginx_stream_ssl_dhparam": "ffdhe2048", "ssl_session_tickets": "on", "nginx_http_ssl_session_tickets": "on", "nginx_stream_ssl_session_tickets": "on", "ssl_session_timeout": "1d", "nginx_http_ssl_session_timeout": "1d", "nginx_stream_ssl_session_timeout": "1d", "proxy_access_log": "/dev/stdout access_fmt", "proxy_error_log": "/dev/stderr", "proxy_stream_access_log": "logs/access.log basic", "proxy_stream_error_log": "logs/error.log", "admin_access_log": "/dev/stdout", "admin_error_log": "/dev/stderr", "vitals_delete_interval_pg": 30, "status_error_log": "/dev/stderr", "vitals_ttl_seconds": 3600, "vitals_ttl_minutes": 1500, "vaults": [ "off" ], "vitals_ttl_days": 0, "lua_ssl_trusted_certificate": {}, "lua_ssl_verify_depth": 1, "lua_ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3", "proxy_ssl_enabled": true, "nginx_stream_lua_ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3", "lua_socket_pool_size": 30, "cluster_control_plane": "127.0.0.1:8005", "cluster_mtls": "shared", "nginx_proxy_real_ip_recursive": "off", "client_max_body_size": "0", "loaded_vaults": {}, "plugins": [ "bundled", "hasher" ], "nginx_http_client_body_buffer_size": "10m", "pg_port": 5432, "status_listeners": [ { "port": 8100, "ssl": false, "ip": "0.0.0.0", "listener": "0.0.0.0:8100" } ], "pg_ssl": false, "pg_ssl_required": false, "pg_ssl_verify": false, "pg_ssl_version": "tlsv1", "cluster_listeners": [ { "backlog=%d+": false, "ssl": false, "ip": "0.0.0.0", "bind": false, "port": 8005, "listener": "0.0.0.0:8005", "http2": false, "proxy_protocol": false, "deferred": false, "reuseport": false } ], "pg_max_concurrent_queries": 0, "pg_semaphore_timeout": 60000, "pg_keepalive_timeout": 60000, "vitals_statsd_udp_packet_size": 1024, "pg_ro_ssl": false, "pg_ro_ssl_required": false, "rbac": "off", "enforce_rbac": "off", "event_hooks_enabled": true, "portal": false, "anonymous_reports": true, "cassandra_contact_points": [ "127.0.0.1" ], "cassandra_port": 9042, "cassandra_ssl": false, "cassandra_ssl_verify": false, "cassandra_write_consistency": "ONE", "route_validation_strategy": "smart", "cluster_telemetry_listeners": [ { "backlog=%d+": false, "ssl": false, "ip": "0.0.0.0", "bind": false, "port": 8006, "listener": "0.0.0.0:8006", "http2": false, "proxy_protocol": false, "deferred": false, "reuseport": false } ], "cassandra_read_consistency": "ONE", "cassandra_lb_policy": "RequestRoundRobin", "dns_hostsfile": "/etc/hosts", "cassandra_refresh_frequency": 60, "dns_error_ttl": 1, "dns_not_found_ttl": 30, "portal_api_listen": [ "0.0.0.0:8004", "0.0.0.0:8447 ssl" ], "cassandra_timeout": 5000, "pg_timeout": 5000, "dns_cache_size": 10000, "worker_state_update_frequency": 5, "admin_gui_listen": [ "0.0.0.0:8002", "0.0.0.0:8445 ssl" ], "admin_gui_ssl_enabled": true, "dns_no_sync": false, "smtp_admin_emails": {}, "ssl_cipher_suite": "intermediate", "ssl_ciphers": "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384", "data_plane_config_cache_mode": "unencrypted", "cluster_telemetry_endpoint": "127.0.0.1:8006", "admin_approved_email": "true", "lua_package_cpath": "", "portal_email_verification": false, "portal_api_ssl_cert_key": {}, "portal_api_ssl_cert": {}, "portal_api_error_log": "logs/portal_api_error.log", "portal_api_access_log": "logs/portal_api_access.log", "portal_gui_ssl_cert_key": {}, "portal_gui_ssl_cert": {}, "admin_gui_ssl_cert_key": [ "/usr/local/kong/ssl/admin-gui-kong-default.key", "/usr/local/kong/ssl/admin-gui-kong-default-ecdsa.key" ], "admin_gui_ssl_cert": [ "/usr/local/kong/ssl/admin-gui-kong-default.crt", "/usr/local/kong/ssl/admin-gui-kong-default-ecdsa.crt" ], "admin_gui_auth_login_attempts": 0, "admin_gui_access_log": "logs/admin_gui_access.log", "admin_gui_error_log": "logs/admin_gui_error.log", "audit_log_payload_exclude": [ "token", "secret", "password" ], "audit_log_record_ttl": 2592000, "audit_log_ignore_tables": {}, "audit_log_ignore_paths": {}, "audit_log_ignore_methods": {}, "vitals_statsd_prefix": "kong", "portal_api_ssl_cert_key_default_ecdsa": "/usr/local/kong/ssl/portal-api-kong-default-ecdsa.key", "portal_api_ssl_cert_default_ecdsa": "/usr/local/kong/ssl/portal-api-kong-default-ecdsa.crt", "portal_gui_ssl_cert_key_default": "/usr/local/kong/ssl/portal-gui-kong-default.key", "go_plugins_dir": "off", "portal_gui_ssl_cert_default": "/usr/local/kong/ssl/portal-gui-kong-default.crt", "portal_gui_ssl_cert_key_default_ecdsa": "/usr/local/kong/ssl/portal-gui-kong-default-ecdsa.key", "portal_gui_ssl_cert_default_ecdsa": "/usr/local/kong/ssl/portal-gui-kong-default-ecdsa.crt", "go_pluginserver_exe": "/usr/local/bin/go-pluginserver", "vitals": false, "portal_api_ssl_cert_default": "/usr/local/kong/ssl/portal-api-kong-default.crt", "admin_gui_ssl_cert_key_default_ecdsa": "/usr/local/kong/ssl/admin-gui-kong-default-ecdsa.key", "tracing_types": { "1": "all", "all": true }, "admin_gui_ssl_cert_default_ecdsa": "/usr/local/kong/ssl/admin-gui-kong-default-ecdsa.crt", "admin_gui_ssl_cert_key_default": "/usr/local/kong/ssl/admin-gui-kong-default.key", "generate_trace_details": false, "admin_gui_ssl_cert_default": "/usr/local/kong/ssl/admin-gui-kong-default.crt", "nginx_portal_gui_err_logs": "/usr/local/kong/logs/portal_gui_error.log", "nginx_portal_gui_acc_logs": "/usr/local/kong/logs/portal_gui_access.log", "nginx_portal_api_err_logs": "/usr/local/kong/logs/portal_api_error.log", "nginx_portal_api_acc_logs": "/usr/local/kong/logs/portal_api_access.log", "nginx_main_worker_rlimit_nofile": "auto", "nginx_events_worker_connections": "auto", "nginx_events_multi_accept": "on", "db_update_frequency": 5, "db_update_propagation": 0, "nginx_admin_client_max_body_size": "10m", "db_cache_ttl": 0, "nginx_admin_client_body_buffer_size": "10m", "nginx_http_lua_regex_match_limit": "100000", "portal_auto_approve": false, "portal_reset_email": true, "portal_reset_success_email": true, "database": "off", "pg_user": "kong", "nginx_http_log_format": "access_fmt '$remote_addr - $remote_user [$time_iso8601] $host \"$request\" $status $body_bytes_sent $request_time \"$http_referer\" \"$http_user_agent\"'", "portal_approved_email": true, "nginx_http_proxy_max_temp_file_size": "4096m", "portal_access_request_email": true, "portal_cors_origins": {}, "portal_invite_email": true, "portal_gui_access_log": "logs/portal_gui_access.log", "portal_gui_error_log": "logs/portal_gui_error.log", "portal_is_legacy": false, "admin_ssl_enabled": true, "status_ssl_enabled": false, "cassandra_keyspace": "kong", "cassandra_username": "kong", "db_cache_warmup_entities": [ "services" ], "nginx_main_directives": [ { "name": "daemon", "value": "off" }, { "name": "user", "value": "kong kong" }, { "name": "worker_processes", "value": "2" }, { "name": "worker_rlimit_nofile", "value": "auto" } ], "nginx_events_directives": [ { "name": "multi_accept", "value": "on" }, { "name": "worker_connections", "value": "auto" } ], "smtp_timeout_read": 60000, "untrusted_lua": "sandbox", "worker_consistency": "strict", "smtp_timeout_connect": 60000, "nginx_upstream_directives": {}, "portal_gui_protocol": "http", "nginx_proxy_directives": [ { "name": "real_ip_header", "value": "X-Real-IP" }, { "name": "real_ip_recursive", "value": "off" } ], "nginx_status_directives": {}, "smtp_ssl": false, "nginx_admin_directives": [ { "name": "client_body_buffer_size", "value": "10m" }, { "name": "client_max_body_size", "value": "10m" } ], "smtp_starttls": false, "nginx_stream_directives": [ { "name": "lua_shared_dict", "value": "stream_prometheus_metrics 5m" }, { "name": "lua_ssl_protocols", "value": "TLSv1.1 TLSv1.2 TLSv1.3" }, { "name": "ssl_dhparam", "value": "/usr/local/kong/ssl/ffdhe2048.pem" }, { "name": "ssl_prefer_server_ciphers", "value": "off" }, { "name": "ssl_protocols", "value": "TLSv1.2 TLSv1.3" }, { "name": "ssl_session_tickets", "value": "on" }, { "name": "ssl_session_timeout", "value": "1d" } ], "smtp_host": "localhost", "nginx_supstream_directives": {}, "nginx_sproxy_directives": {}, "nginx_http_upstream_directives": {}, "nginx_http_status_directives": {}, "nginx_pid": "/usr/local/kong/pids/nginx.pid", "lua_package_path": "/opt/?.lua;;", "nginx_err_logs": "/usr/local/kong/logs/error.log", "nginx_acc_logs": "/usr/local/kong/logs/access.log", "admin_acc_logs": "/usr/local/kong/logs/admin_access.log", "nginx_conf": "/usr/local/kong/nginx.conf", "nginx_kong_conf": "/usr/local/kong/nginx-kong.conf", "nginx_kong_stream_conf": "/usr/local/kong/nginx-kong-stream.conf", "kong_env": "/usr/local/kong/.kong_env", "pluginserver_names": {}, "ssl_cert_default": "/usr/local/kong/ssl/kong-default.crt", "portal_auth_login_attempts": 0, "ssl_cert_key_default": "/usr/local/kong/ssl/kong-default.key", "admin_gui_listeners": [ { "backlog=%d+": false, "ssl": false, "ip": "0.0.0.0", "bind": false, "port": 8002, "listener": "0.0.0.0:8002", "http2": false, "proxy_protocol": false, "deferred": false, "reuseport": false }, { "backlog=%d+": false, "ssl": true, "ip": "0.0.0.0", "bind": false, "port": 8445, "listener": "0.0.0.0:8445 ssl", "http2": false, "proxy_protocol": false, "deferred": false, "reuseport": false } ], "ssl_cert_key_default_ecdsa": "/usr/local/kong/ssl/kong-default-ecdsa.key", "client_ssl_cert_default": "/usr/local/kong/ssl/kong-default.crt", "client_ssl_cert_key_default": "/usr/local/kong/ssl/kong-default.key", "admin_ssl_cert_default": "/usr/local/kong/ssl/admin-kong-default.crt", "admin_ssl_cert_key_default": "/usr/local/kong/ssl/admin-kong-default.key", "admin_ssl_cert_default_ecdsa": "/usr/local/kong/ssl/admin-kong-default-ecdsa.crt", "admin_ssl_cert_key_default_ecdsa": "/usr/local/kong/ssl/admin-kong-default-ecdsa.key", "status_ssl_cert_default": "/usr/local/kong/ssl/status-kong-default.crt", "status_ssl_cert_key_default": "/usr/local/kong/ssl/status-kong-default.key", "status_ssl_cert_default_ecdsa": "/usr/local/kong/ssl/status-kong-default-ecdsa.crt", "status_ssl_cert_key_default_ecdsa": "/usr/local/kong/ssl/status-kong-default-ecdsa.key", "admin_gui_flags": "{}", "admin_gui_auth_header": "******", "rbac_auth_header": "Kong-Admin-Token", "prefix": "/usr/local/kong", "port_maps": [ "80:8000", "443:8443" ], "admin_listen": [ "127.0.0.1:8444 ssl" ], "status_listen": [ "0.0.0.0:8100" ], "stream_listen": [ "off" ], "cluster_listen": [ "0.0.0.0:8005" ], "admin_ssl_cert": [ "/usr/local/kong/ssl/admin-kong-default.crt", "/usr/local/kong/ssl/admin-kong-default-ecdsa.crt" ], "admin_ssl_cert_key": [ "/usr/local/kong/ssl/admin-kong-default.key", "/usr/local/kong/ssl/admin-kong-default-ecdsa.key" ], "nginx_http_lua_shared_dict": "prometheus_metrics 20m", "status_ssl_cert_key": {}, "db_resurrect_ttl": 30, "nginx_user": "kong kong", "nginx_main_user": "kong kong", "nginx_daemon": "off", "nginx_main_daemon": "off", "nginx_worker_processes": "2", "nginx_main_worker_processes": "2", "trusted_ips": {}, "real_ip_header": "X-Real-IP", "nginx_proxy_real_ip_header": "X-Real-IP", "real_ip_recursive": "off", "upstream_keepalive_idle_timeout": 60, "upstream_keepalive_max_requests": 100, "keyring_strategy": "cluster", "client_ssl": false, "vitals_flush_interval": 10, "status_access_log": "off", "nginx_optimizations": true, "kic": true, "cluster_ocsp": "off", "admin_listeners": [ { "backlog=%d+": false, "ssl": true, "ip": "127.0.0.1", "bind": false, "port": 8444, "listener": "127.0.0.1:8444 ssl", "http2": false, "proxy_protocol": false, "deferred": false, "reuseport": false } ], "cluster_max_payload": 4194304, "proxy_listeners": [ { "backlog=%d+": false, "ssl": false, "ip": "0.0.0.0", "bind": false, "port": 8000, "listener": "0.0.0.0:8000", "http2": false, "proxy_protocol": false, "deferred": false, "reuseport": false }, { "backlog=%d+": false, "ssl": true, "ip": "0.0.0.0", "bind": false, "port": 8443, "listener": "0.0.0.0:8443 ssl http2", "http2": true, "proxy_protocol": false, "deferred": false, "reuseport": false } ], "client_body_buffer_size": "8k", "stream_listeners": {}, "portal_token_exp": 21600, "nginx_http_directives": [ { "name": "client_body_buffer_size", "value": "10m" }, { "name": "client_max_body_size", "value": "0" }, { "name": "log_format", "value": "access_fmt '$remote_addr - $remote_user [$time_iso8601] $host \"$request\" $status $body_bytes_sent $request_time \"$http_referer\" \"$http_user_agent\"'" }, { "name": "lua_regex_match_limit", "value": "100000" }, { "name": "lua_shared_dict", "value": "prometheus_metrics 20m" }, { "name": "lua_ssl_protocols", "value": "TLSv1.1 TLSv1.2 TLSv1.3" }, { "name": "proxy_max_temp_file_size", "value": "4096m" }, { "name": "ssl_dhparam", "value": "/usr/local/kong/ssl/ffdhe2048.pem" }, { "name": "ssl_prefer_server_ciphers", "value": "off" }, { "name": "ssl_protocols", "value": "TLSv1.2 TLSv1.3" }, { "name": "ssl_session_tickets", "value": "on" }, { "name": "ssl_session_timeout", "value": "1d" } ], "dns_resolver": {}, "admin_emails_from": "\"\"", "dns_stale_ttl": 4, "portal_gui_listen": [ "0.0.0.0:8003", "0.0.0.0:8446 ssl" ], "dns_order": [ "LAST", "SRV", "A", "CNAME" ], "upstream_keepalive_pool_size": 60, "stream_proxy_ssl_enabled": false, "pg_ro_ssl_verify": false, "cluster_allowed_common_names": {}, "nginx_http_client_max_body_size": "0", "untrusted_lua_sandbox_environment": {}, "smtp_port": 25, "untrusted_lua_sandbox_requires": [ "resty.sha256", "resty.string" ], "portal_gui_host": "127.0.0.1:8003", "audit_log": false, "loaded_plugins": { "ip-restriction": true, "request-transformer": true, "response-transformer": true, "request-size-limiting": true, "rate-limiting": true, "response-ratelimiting": true, "syslog": true, "loggly": true, "datadog": true, "ldap-auth": true, "statsd": true, "bot-detection": true, "aws-lambda": true, "request-termination": true, "prometheus": true, "proxy-cache": true, "session": true, "acme": true, "grpc-gateway": true, "grpc-web": true, "pre-function": true, "post-function": true, "azure-functions": true, "zipkin": true, "hasher": true, "kafka-upstream": true, "exit-transformer": true, "key-auth-enc": true, "upstream-timeout": true, "mocking": true, "opa": true, "jq": true, "jwt": true, "acl": true, "correlation-id": true, "cors": true, "oauth2": true, "tcp-log": true, "application-registration": true, "oauth2-introspection": true, "proxy-cache-advanced": true, "openid-connect": true, "forward-proxy": true, "canary": true, "request-transformer-advanced": true, "response-transformer-advanced": true, "rate-limiting-advanced": true, "ldap-auth-advanced": true, "statsd-advanced": true, "route-by-header": true, "jwt-signer": true, "vault-auth": true, "collector": true, "request-validator": true, "mtls-auth": true, "graphql-proxy-cache-advanced": true, "graphql-rate-limiting-advanced": true, "degraphql": true, "route-transformer-advanced": true, "kafka-log": true, "udp-log": true, "file-log": true, "http-log": true, "key-auth": true, "hmac-auth": true, "basic-auth": true }, "portal_app_auth": "kong-oauth2", "proxy_listen": [ "0.0.0.0:8000", "0.0.0.0:8443 ssl http2" ], "error_default_type": "text/plain", "cluster_data_plane_purge_delay": 1209600, "portal_gui_use_subdomains": false, "host_ports": { "8000": 80, "8443": 443 }, "enabled_headers": { "X-Kong-Response-Latency": true, "Via": true, "Server": true, "X-Kong-Upstream-Latency": true, "X-Kong-Upstream-Status": false, "server_tokens": true, "latency_tokens": true, "X-Kong-Admin-Latency": true, "X-Kong-Proxy-Latency": true }, "smtp_mock": true, "pg_database": "kong", "pg_host": "127.0.0.1", "log_level": "notice", "smtp_domain": "localhost.localdomain", "headers": [ "server_tokens", "latency_tokens" ], "mem_cache_size": "128m", "tracing": false, "ssl_cert_csr_default": "/usr/local/kong/ssl/kong-default.csr", "status_ssl_cert": {}, "admin_invitation_expiry": 259200, "ssl_cert": [ "/usr/local/kong/ssl/kong-default.crt", "/usr/local/kong/ssl/kong-default-ecdsa.crt" ], "portal_api_ssl_cert_key_default": "/usr/local/kong/ssl/portal-api-kong-default.key", "keyring_enabled": false, "ssl_cert_key": [ "/usr/local/kong/ssl/kong-default.key", "/usr/local/kong/ssl/kong-default-ecdsa.key" ], "role": "traditional", "smtp_timeout_send": 60000, "cluster_telemetry_listen": [ "0.0.0.0:8006" ], "tracing_write_strategy": "file", "vitals_strategy": "database", "ssl_cert_default_ecdsa": "/usr/local/kong/ssl/kong-default-ecdsa.crt", "vitals_prometheus_scrape_interval": 5, "nginx_http_lua_ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3", "tracing_time_threshold": 0 }, "hostname": "ingress-kong-c749b599b-cv59f", "node_id": "cb063959-e610-43ac-90e3-ed8dfa06ccaa", "license": { "support_plan": "Platinum", "admin_seats": "Unlimited", "customer": "Coinbase Inc.", "dataplanes": "0", "license_creation_date": "2022-01-31", "license_expiration_date": "2023-01-30", "product_subscription": "Kong Enterprise Edition" }, "timers": { "pending": 11, "running": 1 }, "version": "2.8.1.0-enterprise-edition", "pids": { "workers": [ 2070, 2071 ], "master": 1 } } ```

fffonion commented 2 years ago

We can add some yields to the scrape funtion, if you have lots of entities collected, the worker process won't be able to handle other requests until the full collect is done. How many service/route/consumers do you have in Kong?

seh commented 2 years ago

In one Kubernetes cluster suffering this problem, we have 428 services and 1,633 routes.

I understand that yielding occasionally during the scrape might help, but the real problem here is the periodic call to sync in lua-resty-counter. For each NGINX worker, once per second it acquires and releases the mutex on the shared memory dictionary many times—once per entry in the increments table. That causes high contention for this mutex and blocks each NGINX worker during the long call to get_keys.

Better would be for each NGINX worker to only keep track of its metric values locally, and have them each respond to a "scrape" request—be it via IPC (per an example from five years ago) or an HTTP request—from the worker that handles the collect call upon the Prometheus scrape request. That way, the individual workers would not contend regularly to update their metric values, and we'd only aggregate them when necessary upon each Prometheus scrape request.

seh commented 2 years ago

I have been experimenting with a modified copy of the Prometheus plugin, testing my hypotheses about the cause of this problem. It turns out that I was wrong on several counts or, more forgivingly, I was at least wrong about the significance of several design choices.

I disabled the following operations both in isolation and in various combinations, repeatedly issuing requests to see whether the approximately 1.5 second delay remained:

Use a longer per-worker/shared dictionary synchronization period
(once per minute instead of once per second)
Don't transcribe worker-level "increments" into shared dictionary
Don't increment worker-level metrics
Don't hook the "log" phase to collect metrics
Don't synchronize the worker handling a Prometheus scrape with the shared dictionary
(supposedly only there to make tests work)
Don't sort the set of short metric names fetched from the shared dictionary
Despite being documented as required by the Prometheus exposition format, the Prometheus server does not rely on this sorting when consuming such histogram bucket samples. We find Postel's Law in effect here on both sides.
Don't fetch each metric value from the shared dictionary during a scrape request

To my surprise, none of these have any observable effect on the maximum request serving duration. The only operation that seems to have any effect is the fetching of the set of keys from the shared dictionary via the get_keys function.

At present, in one of the Kubernetes clusters in which I've been testing, our shared dictionary has 3,284 keys. I suspect that the key count is higher—on the order of twice that—in some of our other clusters.

I studied both the documentation and the implementation for the ngix.shared.DICT portion of the OpenResty lua-nginx-module module. I don't see any way to request only a small subset of this key set in a way that would allow the caller to page through the full set. The caller can limit how many keys to receive, but there's no way to skip a prefix of the linearized set based on what the caller has already seen.

I think the next step is to experiment with an alternative design for this bookkeeping. Again, the _ngx_luaipc library looks like a promising part of eliminating our heavy use of this shared dictionary, but getting a Kong container image with OpenResty built with this module included is quite difficult.

seh commented 2 years ago

One incremental approach I neglected to mention is to shard or partition the single shared memory zone used for the Prometheus metrics dictionary. Choose a fixed split factor such as 256, create that many shared memory zones, write a surjective projection function that maps from a short metric name to an integer in [0, 255], and access the dictionary in the appropriate shared memory zones for each metric name.

That's more bookkeeping overhead, but if we assume that a projection function won't yield "hot" partitions, we'd split the size of the key set retrieved from the shared dictionary by the chosen factor. It may turn out that the overhead of making, say, 256 get_keys calls—each returning 1/256 the number of keys—turns out to delay concurrent requests even longer. Alternately, it may provide enough opportunity to interleave processing to reduce the apparent delay.

kikito commented 2 years ago

Hello @seh , I was looking at this issue (for which you have done a very good job at analyzing that the problem) and realized that you are using Kong Enterprise.

This is the Kong Opensource repo. Enterprise issues have a separate tracking (and priority) system - the first step being contacting your sales rep or customer success engineer at Kong. If you have not done so already, I recommend you initiate that as soon as possible.

On your issue: I don't have an answer or a quick fix for it. I can confirm reading from shared dictionaries, especially get_keys needs to get through all the locks, so if the dictionary is constantly being modified by other workers that can take some time (we had to rethink how we implemented events in 3.0 precisely because of this same problem).

seh commented 2 years ago

Thank you, Enrique. We have been working with Kong account representatives and engineers since I filed this issue, so we are in good hands there.

I filed it publicly deliberately because the Prometheus plugin's source is open, and it behaves no differently in the Enterprise Gateway version. I wanted more eyes on the problem—assuming that there may be other users out there ho have not yet realized the cause of this serving delay. After all, it took us about two and a half years to notice it.

We've had to cease not only use of the Prometheus plugin (hooking the "log" serving phase), but also our allowing of Prometheus to scrape our proxy's Admin API. Removing the plugin was not enough. Just serving the HTTP requests against the /metrics path still caused intolerable delays in the proxy's serving of concurrent requests.

ZVilusinsky commented 2 years ago

This was an interesting read and it might explain some spikes I was experiencing. For the opensource users I thank you.

oowl commented 1 year ago

Hi All, for the Prometheus problem, we publish a new blog article to guide us on how to use Prometheus in Kong Gateway. https://konghq.com/blog/how-to-use-prometheus-to-monitor-kong-gateway

locao commented 1 year ago

I'm closing this issue as it was fixed by #9028.

seh commented 1 year ago

Partially fixed, that is.

locao commented 1 year ago

Partially fixed, that is.

Yes, you are right. To completely fix it, a considerably deeper change is required. With the original issue, using Kong in a high-traffic volume environment, a large number of entities, and the Prometheus plugin at the same time was virtually impossible.

Kong / kong