Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/#kong-community
Apache License 2.0
38.89k stars 4.78k forks source link

Noticing high kong latency after upgrading to 3.1.1 #10141

Closed shashanksapre closed 1 year ago

shashanksapre commented 1 year ago

Is there an existing issue for this?

Kong version ($ kong version)

3.1.1

Current Behavior

After upgrading to kong version 3.1.1 from 2.8.3, we are noticing a very high increase in kong latency.

Expected Behavior

Kong latency to be closer to the previous versions if not lower.

Steps To Reproduce

Created a simple setup to route requests to httpbin . org deployed using following helm release

apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: kong
  namespace: default
  annotations:
    fluxcd.io/ignore: "false"
    fluxcd.io/automated: "false"
spec:
  releaseName: kong
  chart:
    repository: https://charts.konghq.com
    name: kong
    version: 2.8.2
  values:
    env:
      headers: "off"
      trusted_ips: "0.0.0.0/0,::/0"
      log_level: "debug"
    image:
      repository: kong
      tag: "2.8.3"
      #tag: "3.0.2"
      #tag: "3.1.1"
    admin:
      enabled: false
    proxy:
      enabled: true
      type: ClusterIP
      http:
        enabled: true
      tls:
        enabled: false
      ingress:
        enabled: true
        tls: api.domain
        hostname: api.domain
        annotations:
          kubernetes.io/ingress.class: nginx
          nginx.ingress.kubernetes.io/service-upstream: "true"
          cert-manager.io/cluster-issuer: letsencrypt-prod
        path: /
    dblessConfig:
      config: |
        _format_version: "2.1"
        _transform: true
        services:
        - name: httpbin
          url: https://httpbin.org
          routes:
          - name: httpbin-v1
            paths:
            - /httpbin/v1
        plugins:
        - name: rate-limiting
          config:
            second: 25
            policy: local
        - name: http-log
          config: 
            http_endpoint: http://fluentd.default.svc.cluster.local:9880/kong.log
            method: POST
    ingressController:
      enabled: false
    resources:
      limits: {}
      requests:
        cpu: 50m
        memory: 256Mi
    podAnnotations:
      sidecar.istio.io/inject: "true"
      traffic.sidecar.istio.io/includeInboundPorts: "*"
    replicaCount: 1

Anything else?

Following observations using http log plugin.

3.1.1

<!DOCTYPE html>

@timestamp | latencies.kong | latencies.proxy | latencies.request -- | -- | -- | -- Jan 19, 2023 @ 15:33:23.175 | 1 | 274 | 275 Jan 19, 2023 @ 15:33:21.816 | 2 | 278 | 280 Jan 19, 2023 @ 15:33:20.340 | 216 | 284 | 500 Jan 19, 2023 @ 15:31:46.369 | 143 | 1467 | 1610 Jan 19, 2023 @ 15:30:57.921 | 408 | 284 | 692 Jan 19, 2023 @ 15:30:21.981 | 433 | 270 | 703

2.8.3

<!DOCTYPE html>

@timestamp | latencies.kong | latencies.proxy | latencies.request -- | -- | -- | -- Jan 19, 2023 @ 15:28:00.952 | 2 | 522 | 524 Jan 19, 2023 @ 15:28:00.048 | 2 | 69 | 71 Jan 19, 2023 @ 15:27:58.894 | 2 | 276 | 278 Jan 19, 2023 @ 15:27:40.609 | 3 | 702 | 705 Jan 19, 2023 @ 15:27:13.932 | 12 | 279 | 291 Jan 19, 2023 @ 15:26:22.329 | 39 | 281 | 320
hanshuebner commented 1 year ago

Hello @shashanksapre,

thank you for reaching out. From looking at the numbers that you've posted, I am not able to conclude that Kong 2.8.3 would be generally faster than 3.1.1. The variances in your measurement result are too large to warrant such a conclusion. In order to make comparisons, you would first need to be able to create results that don't vary by a order of magnitude. Also, you are reaching out to an external service (httpbin) in your test. While there is buffering between the proxy path and the sending of data to the http-log upstream, using external service when assessing Kong's performance is bound to create results that are difficult to compare and analyze.

We're continuously monitoring the performance of our releases, and we have not noticed any dramatic differences between the 2.8 and the 3.x lines. If anything, 3.x has become faster, but it really depends on the plugins that you use whether you can observe such speed improvements.

I would recommend that you perform your testing in an isolated environment on dedidcated hardware and with no external service dependencies. Once you're able to generate stable performance numbers for Kong without plugins, try measuring with one or the other plugins added.

I am not trying to say that it is impossible that Kong or one of its plugins creates a performance issue in your environment, but to investigate that any further, we'll need to see better measurements.

Kind regards, Hans

hanshuebner commented 1 year ago

9964 may be related

shashanksapre commented 1 year ago

Hello @hanshuebner, thank you for your response.

We tried running kong in an isolated environment.

Creating two services to route requests to its own admin api. One using localhost, and another using the kubernetes service name.

- name: admin-v1
  url: http://kong-kong-admin.default.svc.cluster.local:8001/
  routes:
   - name: admin-v1
     paths:
     - /admin/v1

- name: admin-v2
  url: http://localhost:8001/
  routes:
  - name: admin-v2
    paths:
    - /admin/v2

Comparing just the kong part of the latency between the two versions for both services we come to the same result. The kong part of latency is 10 times in 3.x as that in 2.8.x when using the kubernetes service name and almost same when using localhost. Using the same configuration across the versions.

And we also have production data of last 4 months where we just upgraded the kong gateway from 2.8.x to 3.x and noticed the issue first hand.

image

hanshuebner commented 1 year ago

Hi @shashanksapre,

thank you for providing more details. With the graphs, the problem seems clearer. Did you attempt to isolate the problem to either of the two plugins (rate-limiting and http-log) that you use?

Thank you, Hans

shashanksapre commented 1 year ago

From the observation it has something to do with dns resolution since putting localhost/ip in 3.x improves the results.

locao commented 1 year ago

Hi @shashanksapre,

Thanks for all these details. Did you compare the impact of DNS resolution between 2.8.x and 3.x? Using IP addresses makes 3.x perfomance similar to 2.8.x?

shashanksapre commented 1 year ago

Hi @shashanksapre,

Thanks for all these details. Did you compare the impact of DNS resolution between 2.8.x and 3.x? Using IP addresses makes 3.x perfomance similar to 2.8.x?

Yes, that's correct. When we put hostnames, regardless of whether it's an internal or external service, the kong latency is higher in 3.x. when it's IP, performance is similar in both 2.8.x and 3.x

shashanksapre commented 1 year ago

Hello, can we please get an update?

hanshuebner commented 1 year ago

@shashanksapre We have noticed your issue report and are working on it. We cannot give you a firm date when we will have a solution. If you require well-defined response times for your issues, please check out the enterprise version of Kong Gateway https://konghq.com/products/api-gateway-platform

hanshuebner commented 1 year ago

@bungle digged out this commit https://github.com/Kong/kong/commit/3b721ac034378614f65ec2106211e6459c148896 which changed the caching defaults of Kong's DNS client. This first became part of the 3.0 release, so it might be related to the issue that you're seeing. @shashanksapre Would you be able to change the default as seen in that commit from true to false and measure whether that solves the issue for you?

Thanks, Hans

shashanksapre commented 1 year ago

@hanshuebner I am unable to see any way to set this using the kong charts. Can you please tell where we can set this.

hanshuebner commented 1 year ago

@shashanksapre It would require a source-level modification to Kong. If you're strictly running off release images, that won't be an option. We're still investigating and may eventually be able to reproduce the issue ourselves, though.

arniesaha commented 1 year ago

Facing a similar issue with 3.1 after upgrading for 2.8.3 Ran couple of benchmarks to compare numbers across both

1/ with latest kong 3.1 - existing config - no restart - 4.39 error % - 45.6 rps Memory utilisation - 1.6Gb + 2/ with latest kong 3.1 - existing config - with restar - 8.69 error % - 51.4 rps Memory utilisation - 1.1Gb 3/ with kong 2.8.3 - 1G mem limit / CPU HPA - 2.38 error % - 46.2 rps - Memory utilisation - ~1Gb 4/ with kong 2.8.3 - 2G mem limit / CPU&Mem HPA 2.38 error % - 46.7 rps - Memory utilisation - ~1Gb 5/ with kong 2.8.3 - restart with 4th case config - 1.79 error % - 45.1 rps - Memory utilisation - ~1Gb 1k VU for 180secs

General stability of 2.8.3 is better than 3.1.

kong 3.1 - continues to grow in it's memory utilisation and latency over a period of time - and requires a restart for stable behaviour unlike 2.8.x

image image

seh commented 1 year ago

Are you using any plugins that rely on the _batchqueue library? If so, see #10103 and the PRs related to it.

arniesaha commented 1 year ago

Are you using any plugins that rely on the _batchqueue library? If so, see #10103 and the PRs related to it.

Our custom plugins don't. But we use the following community plugins: https://docs.konghq.com/hub/kong-inc/ip-restriction/ https://docs.konghq.com/hub/kong-inc/response-transformer/ https://docs.konghq.com/hub/kong-inc/request-termination/ https://docs.konghq.com/hub/kong-inc/rate-limiting/ https://docs.konghq.com/hub/kong-inc/prometheus/ https://docs.konghq.com/hub/kong-inc/http-log/

^ any of these might be using this library you might be aware of?

seh commented 1 year ago

_httplog uses it.

arniesaha commented 1 year ago

_httplog uses it.

Thanks! Will profile without this plugin to confirm!

arniesaha commented 1 year ago

_httplog uses it.

Was able to reproduce and see better results without the http_log plugin with 3.1

And overall fixed with 3.1.1 image

Thanks!

seh commented 1 year ago

And overall fixed with 3.1.1 !

Did you build version 3.1.1 yourself? I don't see a published release with that version number.

arniesaha commented 1 year ago

And overall fixed with 3.1.1 !

Did you build version 3.1.1 yourself? I don't see a published release with that version number.

I built my own with this published version from 7 days back

ADD-SP commented 1 year ago

@shashanksapre I run the benchmark with the same config. First I used the IP as the upstream. Second I used the hostname as the upstream, but I don't see significant changes for RPS and latency.

So I recommend to make sure the DNS server is stable, and use the tool like wrk to run and get the benchmark result.

shashanksapre commented 1 year ago

@shashanksapre I run the benchmark with the same config. First I used the IP as the upstream. Second I used the hostname as the upstream, but I don't see significant changes for RPS and latency.

So I recommend to make sure the DNS server is stable, and use the tool like wrk to run and get the benchmark result.

Hi, the DNS server has been working fine. The whole setup is within a kubernetes (Amazon EKS) cluster which has latest patches. The only thing we changed is the kong image (upgraded from 2.8.x to 3.x)

ghost commented 1 year ago

Hi, the DNS server has been working fine. The whole setup is within a kubernetes (Amazon EKS) cluster which has latest patches. The only thing we changed is the kong image (upgraded from 2.8.x to 3.x)

A quick way to verify this is to replace the FQDN with an ip to check if performance is still experiencing high latency.

This will help us narrow down the scope of the problem.

shashanksapre commented 1 year ago

Hi, the DNS server has been working fine. The whole setup is within a kubernetes (Amazon EKS) cluster which has latest patches. The only thing we changed is the kong image (upgraded from 2.8.x to 3.x)

A quick way to verify this is to replace the FQDN with an ip to check if performance is still experiencing high latency.

This will help us narrow down the scope of the problem.

I'm not sure if that's possible in an EKS cluster.

ghost commented 1 year ago

I'm not sure if that's possible in an EKS cluster.

just try.

shashanksapre commented 1 year ago

I'm not sure if that's possible in an EKS cluster.

just try.

The reason we can't try is it's our production system.

ghost commented 1 year ago

I will try my best to reproduce your scenario and validate it again.

ghost commented 1 year ago

I deployed Kong using https://github.com/wjziv/kong-k8s-example/tree/main/basic-implementation, removed the rate-limiting plugin, and added the http-log plugin with the following configuration:

apiVersion: configuration.konghq.com/v1
kind: KongClusterPlugin
metadata:
  name: global-http-log
  annotations:
    kubernetes.io/ingress.class: kong
  labels:
    global: "true"
config: 
  http_endpoint: http://echo.default.svc.cluster.local:80
  method: POST
  timeout: 1000
  keepalive: 1000
  flush_timeout: 2
  retry_count: 15
plugin: http-log

This allows http-log to access the echo service through FQDN.

I tested both Kong 2.8 and Kong 3.1, and found that their performance is similar. In terms of latency analysis, Kong 3.1 has slightly better latency than Kong 2.8.

I still hasn't been reproduced.

shashanksapre commented 1 year ago

Hello. We upgraded to version 3.2.x and that has fixed our problem.