Upstream response time mismatch.

ksgnextuple commented 5 days ago

Discussed in https://github.com/Kong/kong/discussions/13300

^{Originally posted by **ksgnextuple** June 25, 2024} Hi Team We are running few performance tests on of our apis and we see that the kong_upstream_response time reported on Prometheus is having a large difference as compared to the actual upstream latency. Kong : 3.6 KIC: 3.1 The same test was run using Kong 2.8 and KIC 2.5 we saw that this difference was minimal which is expected behavior. Has anyone faced such an issue?

chobits commented 5 days ago

Prometheus is having a large difference as compared to the actual upstream latency. Could you give us some examples?

For the actual upstream latency, could you show us how you monitored this actual latency value?

ksgnextuple commented 5 days ago

We have instrumented the application using newrelic java agent, and also monitoring it using prometheus micrometer metrics. We saw that on both the places the latency were matching and much lower than the upstream reported by the prometheus plugin of kong. This was not the case with Kong 2.8, where the upstream latency reported by kong prom plugin was pretty close to latency observed on the above places.

The difference we see with kong 3.6 is around 100ms (sometimes more) whereas it was not more than 10ms with kong 2.8.

hanshuebner commented 4 days ago

Can you provide a way to reproduce this issue?

ksgnextuple commented 4 days ago

So we use a custom plugin (for auth purposes) and when we removed it we saw improvements, the kong reported upstream latencies were comparable to the actual. But this plugin was present in the previous versions of kong too. Will custom plugins have an effect on the reporting of the upstream latency though? I was expecting it to bring down the kong latency which it did, but was not expecting the kong reported upstream latency to come down.

hanshuebner commented 4 days ago

It is not impossible that there is a bug in how the Prometheus plugin response upstream latency that was introduced between 2.8 and 3.6. To dig deeper, we'd need a way to reproduce. Can you share details on what your custom plugin does, i.e. what functions it implements and what is done in them?

ksgnextuple commented 4 days ago

So the plugin intercepts every request made to kong and then makes a HTTP POST request to a custom auth service which has authorization logic. The plugin script itself is based on this custom plugin -> https://github.com/pantsel/kong-middleman-plugin/tree/master

We have modified the above a bit and can share those scripts if required.

hanshuebner commented 4 days ago

It would be useful to have a way to reproduce the problem, so if you can supply your plugin as well as a minimal configuration for reproduction, it'd help.

ksgnextuple commented 4 days ago

middleman-plugin-cm.zip

Attached the configMap as a zip

Below is the sample plugin yaml

apiVersion: configuration.konghq.com/v1
config:
  response: table
  timeout: 2
  url: http://<endpoint>/<some-api>
kind: KongPlugin
metadata:
  annotations:
    kubernetes.io/ingress.class: <className>
  name: middleman-plugin
plugin: middleman

hanshuebner commented 3 days ago

@ksgnextuple Please provide the plugin in a ZIP containing the individual files, not as a YAML file. Also, provide instructions to reproduce the problem.

hanshuebner commented 3 days ago

@ksgnextuple Don't bother with the plugin files. I was able to extract them from the YAML file and to set up a test environment. I could not reproduce the issue, however. With the kong:latest image, your plugin and the Prometheus plugin installed and the middleman http server responding after a 200ms delay, I see these metrics (elided):

# HELP kong_request_latency_ms Total latency incurred during requests for each service/route in Kong
# TYPE kong_request_latency_ms histogram
kong_request_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="250"} 24
# HELP kong_upstream_latency_ms Latency added by upstream response for each service/route in Kong
# TYPE kong_upstream_latency_ms histogram
kong_upstream_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="25"} 24

As you can see, the request latency for all 24 requests is below 250ms, and the upstream latency is below 25ms. Hence, the upstream response time metrics don't include the time that the middleman server takes to respond.

From our perspective, it does not look like there is a problem in Kong Gateway here. If you think different, please provide a self-contained way to reproduce the problem.

hanshuebner commented 3 days ago

Here is my test environment in case you want to experiment with it yourself: middleman.zip

ksgnextuple commented 3 days ago

@hanshuebner Thanks for the inputs, will try this setup on our end and get back.

ksgnextuple commented 2 days ago

@hanshuebner I setup the test environment, but I set it up on a k8s cluster as that's where we run our tests. First I ran a 15 min test and used the below Prometheus queries to record the response times (p95)

histogram_quantile(0.95, sum by(le) (rate(kong_kong_latency_ms_bucket{kubernetes_namespace="kong", service="default.echo-service.80"}[$__interval])))

histogram_quantile(0.95, sum by(le) (rate(kong_request_latency_ms_bucket{kubernetes_namespace="kong", service="default.echo-service.80"}[$__interval])))

histogram_quantile(0.95, sum by(le) (rate(kong_upstream_latency_ms_bucket{kubernetes_namespace="kong", service="default.echo-service.80"}[$__interval])))

These were the results:

Without Middleman

P95 Kong - 4.29ms P95 Request - 17.9ms P95 Upstream - 17.8ms

Then I ran a another test with Middleman Enabled, below are the results:

P95 Kong - 364ms P95 Request - 485ms P95 Upstream - 325ms

Pasting the k8s yamls for reproduction if required.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echo-server
  template:
    metadata:
      labels:
        app: echo-server
    spec:
      containers:
        - name: echo-server
          image: test-middleman
          # command: [ "/bin/bash", "-c", "--" ]
          # args: [ "while true; do sleep 30; done;" ]
          resources:
            requests:
              cpu: 500m
              memory: 100Mi
            limits:
              cpu: 500m
              memory: 100Mi
          ports:
            - name: http-port
              containerPort: 80

---
apiVersion: v1
kind: Service
metadata:
  name: echo-service
spec:
  ports:
    - name: http-port
      port: 80
      targetPort: http-port
      protocol: TCP
  selector:
    app: echo-server
---

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    konghq.com/plugins: middleman-plugin
  name: echo-service
spec:
  rules:
    - host: "poc.example.com"
      http:
        paths:
          - path: /
            pathType: ImplementationSpecific
            backend:
              service:
                name: echo-service
                port:
                  number: 80

Do let me know if the plugin yaml is required for the setup on k8s cluster.

hanshuebner commented 2 days ago

The latency numbers that you report don't tell me that the Prometheus plugin includes the upstream time, but rather that the plugin has drastic negative effects on Kong Gateway's overall performance. The reason most likely is that it includes a http client that is not interacting with OpenResty well, causing proxy performance to be massively affected. I would recommend rewriting the plugin to use lua-resty-http. This will greatly simplify the code and likely solve your performance issues.

ksgnextuple commented 1 day ago

I was having a look at the response headers which gives the X-Kong-Upstream-Latency and X-Kong-Proxy-Latency. The X-Kong-Proxy Latency is actually very consistent, around below 220ms which is expected, but the upstream latency > 500ms. This is with the middleman plugin. I also updated the httpclient.lua to make use of the lua-resty-http and still similar results. Trying to modify the plugin a bit by adding keepalive pool will check if that helps.

Will run few tests by having 2 different instances serving the hello world endpoint and the middleman to make sure to get cleaner results

Update: Both the above changes didn't help.

hanshuebner commented 1 day ago

Please provide us with a complete, self-contained way to reproduce. We're not normally using Kubernetes for our Gateway testing, so we either need a reproduction setup that includes instructions how to set up K8s locally or that uses docker compose.

ksgnextuple commented 1 day ago

@hanshuebner Update from my end, looks like it is a plugin issue, updated the script to use cjson instead of the json script I had shared and made use of the neturl library instead of the https://github.com/lunarmodules/luasocket/blob/master/src/url.lua module and things seem to be good now.

But one small thing I observed is the P95 of the Kong Proxy Latency is showing up as more than the P95 Request Latency. I could share the updated plugin script here.

We can make use of docker compose itself, but I am kind of doing a load test on the api - 10tps currently.

hanshuebner commented 1 day ago

If there is anything you want us to look at, please share reproduction steps and relevant scripts / configuration files. Thank you!

ksgnextuple commented 23 hours ago

Attaching the updated plugin scripts here. You could continue to use the same test env that was previously setup. All I want to check is whether the metrics reported by the prometheus plugin is valid. I ran a 10 TPS test with the test environment that was shared above. To me it seems like the kong proxy latency shown by the plugin is not too accurate.

plugin.zip

hanshuebner commented 8 hours ago

I ran your updated plugin to see whether I can reproduce the results. I ran wrk to generate some load and then queried the Prometheus metrics endpoint to determine the latency information:

cadet 1190_% docker compose exec -u 0 -it client bash
root@4186f90bf3fe:/# wrk -c64 -t64 http://kong:8000/
Running 10s test @ http://kong:8000/
  64 threads and 64 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   220.86ms   23.93ms 325.92ms   86.42%
    Req/Sec     4.02      0.66     5.00     59.86%
  2880 requests in 10.10s, 0.88MB read
Requests/sec:    285.17
Transfer/sec:     89.50KB
root@4186f90bf3fe:/# http kong:8100/metrics
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Connection: keep-alive
Content-Type: text/plain; charset=UTF-8
Date: Mon, 01 Jul 2024 07:51:46 GMT
Server: kong/3.7.1
Transfer-Encoding: chunked
X-Kong-Admin-Latency: 6

# HELP kong_kong_latency_ms Latency added by Kong and enabled plugins for each service/route in Kong
# TYPE kong_kong_latency_ms histogram
kong_kong_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="200"} 2
kong_kong_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="500"} 2943
[...]
# HELP kong_request_latency_ms Total latency incurred during requests for each service/route in Kong
# TYPE kong_request_latency_ms histogram
kong_request_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="250"} 2627
kong_request_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="400"} 2943
[...]
# HELP kong_upstream_latency_ms Latency added by upstream response for each service/route in Kong
# TYPE kong_upstream_latency_ms histogram
kong_upstream_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="25"} 2643
kong_upstream_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="50"} 2936
kong_upstream_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="80"} 2943

My takeways from this are:

The kong_latency metric reports the time from when the connection was opened until after the load balancer has been invoked to determine the upstream to send the request to. It is uncertain to me how useful this metric is in practice, as connections may be reused.

The request_latency is taken from nginx's $request_time variable, which is defined as follows:

request processing time in seconds with a milliseconds resolution; time elapsed between the first bytes were read from the client and the log write after the last bytes were sent to the client

The upstream_latency metric reports the time from when the upstream was selected and forwarding began to when Kong Gateway received the first bytes of the response.

The accuracy of the latency measurements can be influenced by the load that is put onto Kong Gateway in some situations. In particular, the reported upstream response time includes the time that the underlying software (i.e. nginx) needs to react to data received on the socket and scheduling the request handling process.

Let us know if any questions remain, otherwise please close the issue.

Kong / kong

Upstream response time mismatch. #13302

Discussed in https://github.com/Kong/kong/discussions/13300