Kong / kong-plugin-prometheus

Prometheus plugin for Kong - this plugin has been moved into https://github.com/Kong/kong, please open issues and PRs in that repo
Apache License 2.0
119 stars 57 forks source link

Prometheus plugin uses too many CPU #43

Closed zeeshen closed 4 years ago

zeeshen commented 5 years ago

image Prometheus plugin (global enabled) uses too many CPU. (4 cores, 8 workers)

With perf, we can find that ngx_http_lua_ffi_shdict_incr(10.9%) occupied more CPU than nginx main(7.8%). image Same server, same traffic, with prometheus plugin disabled, nginx master 30%: image

The heavy CPU load might be related with heavy usage of shared.dict.incr, at least 4 calls (http_status, latency_sum, latency_count, latency_bucket) per request. Meanwhile, shared.dict.incr is implemented by ngx_shmtx_lock, underlying a spin lock. ngx_http_lua_ffi_shdict_incr ngx_shmtx_lock

Maybe writing shared dict on every request is not a good choice as monitoring using more CPU than main processor? Is that ok if every worker keep metrics in its own memory, and set a timer flush metrics into shared dict periodically? The lack of "real-time" might be acceptable as prometheus scape interval is offen set in seconds.

p0pr0ck5 commented 5 years ago

Just a very brief note, using 8 workers on 4 cores isn't helping anything, as that will just increase lock contention in the critical path :)

That said, yes, we've discussed recently a few was to optimize some parts of the design. A brief few questions- how many Services are defined in your Kong cluster, and what does your throughput look like?

zeeshen commented 5 years ago

using 8 workers on 4 cores isn't helping anything

Yes, we've updated this config.

how many Services are defined

283

what does your throughput look like

2k+ QPS for a single kong node at peak times. And each kong node is running on an AWS m4.xlarge instance(4 core 16GB).

p0pr0ck5 commented 5 years ago

Thanks for the update!

dliberman commented 5 years ago

Would a simple workaround to introduce a config on prometheus config to select which metrics to collect? Let's say I'm not interested in latency, just http status and bandwidth consumption: we cut by half the number of access to the shared dict to store stats on every request.

hbagdi commented 5 years ago

Would a simple workaround to introduce a config on prometheus config to select which metrics to collect? Let's say I'm not interested in latency, just http status and bandwidth consumption: we cut by half the number of access to the shared dict to store stats on every request.

Certainly, this requires some changes to the backing Prometheus library that we are using but is a feature that we would like to support.

Another solution that will greatly improve the plugin's performance is by storing all the metrics at a worker level and sync those periodically into the shared dict, which will greatly reduce the contention due to locks.

chensunny commented 5 years ago

https://github.com/Kong/kong-plugin-prometheus/issues/43