Plugin and timer performance on Kong 3.x

dcarley commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Kong version (`$ kong version`)

3.0.1 (also tested against 3.1.0)

Current Behavior

(sorry, this is long)

We have a set of integration tests that run against our installation of Kong. Our rate limiting tests have been failing intermittently, with more requests allowed than there should be, when testing an upgrade from Kong 2.8.3 to Kong 3.x (initially 3.0.0 but we've more recently tried 3.0.1 and 3.1.0). When this happens it appears to take a while to recover and sometimes doesn't recover at all. That's not a terribly useful report without a complete reproduction of how we deploy and test Kong though so I've attempted to narrow it down.

One way that I've been able to reproduce the same symptoms is by configuring the Zipkin plugin (which we normally use) with an endpoint that times out (e.g. an unreachable IP address). When generating requests against 3.x this immediately causes more requests to be allowed than there should be. When increasing the request rate it eventually causes other plugins and operations that depend on timers to also fail and not recover:

api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:18 [error] 1109#0: *17773 [kong] handler.lua:106 zipkin reporter flush failed to request: timeout, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:19 [error] 1109#0: *16756 [kong] handler.lua:106 zipkin reporter flush failed to request: timeout, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:19 [error] 1109#0: *18694 [kong] handler.lua:106 zipkin reporter flush failed to request: timeout, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:19 [error] 1109#0: *18691 [kong] handler.lua:106 zipkin reporter flush failed to request: timeout, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:19 [warn] 1110#0: *151 [lua] super.lua:120: scaling_execute(): [timer-ng] overload: load_avg: 10.239453125, runable_jobs_avg: 2621.3, alive_threads_avg: 256, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:31 [warn] 1110#0: *151 [lua] super.lua:120: scaling_execute(): [timer-ng] overload: load_avg: 10.26875, runable_jobs_avg: 2628.8, alive_threads_avg: 256, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:44 [warn] 1110#0: *151 [lua] super.lua:120: scaling_execute(): [timer-ng] overload: load_avg: 10.294921875, runable_jobs_avg: 2635.5, alive_threads_avg: 256, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [warn] 1110#0: *151 [lua] super.lua:120: scaling_execute(): [timer-ng] overload: load_avg: 10.324609375, runable_jobs_avg: 2643.1, alive_threads_avg: 256, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16493 [kong] handler.lua:74 failed to create Statsd logger: failed to connect to metrics.kube-system.svc.cluster.local.:8125: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)metrics.kube-system.svc.cluster.local.:(na) - cache-miss","metrics.kube-system.svc.cluster.local.:33 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:1 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:5 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16494 [kong] handler.lua:74 failed to create Statsd logger: failed to connect to metrics.kube-system.svc.cluster.local.:8125: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)metrics.kube-system.svc.cluster.local.:(na) - cache-miss","metrics.kube-system.svc.cluster.local.:33 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:1 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:5 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16495 [kong] handler.lua:74 failed to create Statsd logger: failed to connect to metrics.kube-system.svc.cluster.local.:8125: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)metrics.kube-system.svc.cluster.local.:(na) - cache-miss","metrics.kube-system.svc.cluster.local.:33 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:1 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:5 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer
…
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16501 [kong] init.lua:75 failed to connect to Redis: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)redis.kong.svc.cluster.local.:(na) - cache-miss","redis.kong.svc.cluster.local.:33 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:1 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:5 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer
…
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16505 [kong] init.lua:75 failed to connect to Redis: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)redis.kong.svc.cluster.local.:(na) - cache-miss","redis.kong.svc.cluster.local.:33 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:1 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:5 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16506 [kong] init.lua:75 failed to connect to Redis: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)redis.kong.svc.cluster.local.:(na) - cache-miss","redis.kong.svc.cluster.local.:33 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:1 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:5 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer

The significant difference between 2.x and 3.x is that it appears to fail with lower requests rates, earlier on, and doesn't recover. I think that rate limiting is acting as an "early warning" of this because delays in running timers means that usage counters aren't being incremented quick enough within each "second" bucket.

Expected Behavior

The performance of one plugin and its network dependencies shouldn't adversely affect the performance of other plugins and operations.

I expect that changes like https://github.com/Kong/kong/pull/9538 (rate-limiting) and https://github.com/Kong/kong/pull/9521 (Datadog) will alleviate some of the symptoms of this. I wasn't able to reproduce the same problem with the OpenTelemetry plugin, which already batches+intervals submissions. I suspect that applying the same changes to the Zipkin plugin would help.

However I think there's also an underlying problem where timer performance appears to be worse under certain circumstances in 3.x and these kinds of issues are likely to reoccur without some safety guards. Which is what I'm most interested in focusing on.

Steps To Reproduce

This is the simplest way that I could reproduce the symptom. I wasn't able to push it hard enough to cause timer errors in Gojira, like we can in Minikube, because Docker appeared to fall over first.

Install:

vegeta: https://github.com/tsenart/vegeta
jaggr: https://github.com/rs/jaggr
gojira using this branch: https://github.com/Kong/gojira/pull/74

Create egg.yaml:

version: '3.5'
services:
  kong:
    ports:
      - 8000:8000
      - 8001:8001
  httpbin:
    image: "kong/httpbin:latest"
    labels:
      com.konghq.gojira: "True"
    networks:
    - gojira

Create kong.yaml

_format_version: "2.1"
_transform: true

plugins:
  - name: zipkin
    config:
      http_endpoint: http://172.18.0.254/will/timeout
      sample_ratio: 1
  - name: rate-limiting
    config:
      second: 50
      limit_by: ip
      policy: redis
      redis_host: redis

services:
- name: httpbin
  url: http://httpbin:80
  routes:
  - name: httpbin
    paths:
    - /

Start the container:

gojira up --tag 3.0.1
gojira run --tag 3.0.1 make dev
gojira snapshot --tag 3.0.1
gojira up --tag 3.0.1 --egg egg.yaml --yml kong.yaml

Run a load test:

echo "GET http://127.0.0.1:8000/status/200" | vegeta attack -rate 200 -duration 2m | tee vegeta.out | vegeta encode | ~/go/bin/jaggr '@count=rps' 'hist[100,200,300,400,500]:code' | tee jaggr.out

Check the results, ideally:

jaggr.out should show roughly 50 responses with 200 statuses each second
vegeta report vegeta.out should show a success rate of 25% (50/200)

Anything else?

I've put the results of my tests in a Gist so as not to clutter the issue: https://gist.github.com/dcarley/cc3c8959fd8f6a811d0b3c0ddf458a5c

flrgh commented 1 year ago

@ADD-SP what do ya think?

ADD-SP commented 1 year ago

@flrgh We've run into this type of issue, which is essentially timer abuse, and I think we should fix this issue like https://github.com/Kong/kong/pull/9521 did.

kikito commented 1 year ago

While we will investigate this issue and put some limits on the amount of memory and timers that zipkin consumes, I must point out that this setting is not realistic for production environments:

sample_ratio: 1

The intended usage for the Zipkin plugin is that tracing is done on a sample of all the requests - that is why the default value for sample_ratio is 0.001. It means that 1 out of 1000 requests will produce traces. Setting it to 1 is akin to setting the log level to debug. Something that would be done on a production system on an emergency setting temporarily in order to debug a critical failure.

I will make sure that we point this out in our docs.

dcarley commented 1 year ago

We use a sampling rate of 100% because we do dynamic sampling in Refinery. We've been using this setup with 2.x in production mostly without problems since we added timeouts in #8735. We are planning to migrate to the OpenTelemetry plugin, with the same sample rate, once we've validated that the upgrade path to 3.x is safe. Why is why I'm most interested in:

However I think there's also an underlying problem where timer performance appears to be worse under certain circumstances in 3.x and these kinds of issues are likely to reoccur without some safety guards. Which is what I'm most interested in focusing on.

I'm not able to view the issues FTI-4367 and FTI-4269 that were referenced in #9521. Are you able to provide some more information about what originally prompted that change?

Is there anything else that I can do to help?

ADD-SP commented 1 year ago

@dcarley I'm curious in which cases the performance of the timer becomes bad. In 3.x we introduced lua-resty-timer-ng to reduce the overhead of the timer, and it shouldn't get any worse.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dcarley commented 1 year ago

@dcarley I'm curious in which cases the performance of the timer becomes bad. In 3.x we introduced lua-resty-timer-ng to reduce the overhead of the timer, and it shouldn't get any worse.

We've observed that 3.x performance is worse under load when a plugin makes calls to an external network dependency and those calls take longer than normal.

Are you able to provide any information from FTI-4367, FTI-4269, and FT-3464? Were those issues prompted by timer performance?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

surenraju-careem commented 1 year ago

Another issue related to time abuse https://github.com/Kong/kong/issues/9959

xiaotuzididi commented 1 year ago

good thankyou

ADD-SP commented 1 year ago

Hi @dcarley , could you please try 3.2.2 to check if this issue exists now?

dcarley commented 1 year ago

Hi @dcarley , could you please try 3.2.2 to check if this issue exists now?

3.2.2 appears to be worse than 3.0.1 for the reproduction case where rate-limiting "fails open" when the zipkin endpoint is timing out. I've updated the results in the gist here: https://gist.github.com/dcarley/cc3c8959fd8f6a811d0b3c0ddf458a5c#gojira-322

StarlightIbuki commented 1 year ago

Dear contributor, We're closing this issue as there hasn't been any update to it for a long time. If the issue is still relevant in the latest version, please feel free to reopen it. We're more than happy to revisit it again. Your contribution is greatly appreciated! Please have a look at our pledge to the community for more information. Sincerely, Kong Gateway Team

Kong / kong