Open ksgnextuple opened 5 days ago
Prometheus is having a large difference as compared to the actual upstream latency. Could you give us some examples?
For the actual upstream latency
, could you show us how you monitored this actual latency value?
We have instrumented the application using newrelic java agent, and also monitoring it using prometheus micrometer metrics. We saw that on both the places the latency were matching and much lower than the upstream reported by the prometheus plugin of kong. This was not the case with Kong 2.8, where the upstream latency reported by kong prom plugin was pretty close to latency observed on the above places.
The difference we see with kong 3.6 is around 100ms (sometimes more) whereas it was not more than 10ms with kong 2.8.
Can you provide a way to reproduce this issue?
So we use a custom plugin (for auth purposes) and when we removed it we saw improvements, the kong reported upstream latencies were comparable to the actual. But this plugin was present in the previous versions of kong too. Will custom plugins have an effect on the reporting of the upstream latency though? I was expecting it to bring down the kong latency which it did, but was not expecting the kong reported upstream latency to come down.
It is not impossible that there is a bug in how the Prometheus plugin response upstream latency that was introduced between 2.8 and 3.6. To dig deeper, we'd need a way to reproduce. Can you share details on what your custom plugin does, i.e. what functions it implements and what is done in them?
So the plugin intercepts every request made to kong and then makes a HTTP POST request to a custom auth service which has authorization logic. The plugin script itself is based on this custom plugin -> https://github.com/pantsel/kong-middleman-plugin/tree/master
We have modified the above a bit and can share those scripts if required.
It would be useful to have a way to reproduce the problem, so if you can supply your plugin as well as a minimal configuration for reproduction, it'd help.
Attached the configMap as a zip
Below is the sample plugin yaml
apiVersion: configuration.konghq.com/v1
config:
response: table
timeout: 2
url: http://<endpoint>/<some-api>
kind: KongPlugin
metadata:
annotations:
kubernetes.io/ingress.class: <className>
name: middleman-plugin
plugin: middleman
@ksgnextuple Please provide the plugin in a ZIP containing the individual files, not as a YAML file. Also, provide instructions to reproduce the problem.
@ksgnextuple Don't bother with the plugin files. I was able to extract them from the YAML file and to set up a test environment. I could not reproduce the issue, however. With the kong:latest
image, your plugin and the Prometheus plugin installed and the middleman http server responding after a 200ms delay, I see these metrics (elided):
# HELP kong_request_latency_ms Total latency incurred during requests for each service/route in Kong
# TYPE kong_request_latency_ms histogram
kong_request_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="250"} 24
# HELP kong_upstream_latency_ms Latency added by upstream response for each service/route in Kong
# TYPE kong_upstream_latency_ms histogram
kong_upstream_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="25"} 24
As you can see, the request latency for all 24 requests is below 250ms, and the upstream latency is below 25ms. Hence, the upstream response time metrics don't include the time that the middleman server takes to respond.
From our perspective, it does not look like there is a problem in Kong Gateway here. If you think different, please provide a self-contained way to reproduce the problem.
Here is my test environment in case you want to experiment with it yourself: middleman.zip
@hanshuebner Thanks for the inputs, will try this setup on our end and get back.
@hanshuebner I setup the test environment, but I set it up on a k8s cluster as that's where we run our tests. First I ran a 15 min test and used the below Prometheus queries to record the response times (p95)
histogram_quantile(0.95, sum by(le) (rate(kong_kong_latency_ms_bucket{kubernetes_namespace="kong", service="default.echo-service.80"}[$__interval])))
histogram_quantile(0.95, sum by(le) (rate(kong_request_latency_ms_bucket{kubernetes_namespace="kong", service="default.echo-service.80"}[$__interval])))
histogram_quantile(0.95, sum by(le) (rate(kong_upstream_latency_ms_bucket{kubernetes_namespace="kong", service="default.echo-service.80"}[$__interval])))
These were the results:
Without Middleman
P95 Kong - 4.29ms P95 Request - 17.9ms P95 Upstream - 17.8ms
Then I ran a another test with Middleman Enabled, below are the results:
P95 Kong - 364ms P95 Request - 485ms P95 Upstream - 325ms
Pasting the k8s yamls for reproduction if required.
apiVersion: apps/v1
kind: Deployment
metadata:
name: echo-deployment
spec:
replicas: 1
selector:
matchLabels:
app: echo-server
template:
metadata:
labels:
app: echo-server
spec:
containers:
- name: echo-server
image: test-middleman
# command: [ "/bin/bash", "-c", "--" ]
# args: [ "while true; do sleep 30; done;" ]
resources:
requests:
cpu: 500m
memory: 100Mi
limits:
cpu: 500m
memory: 100Mi
ports:
- name: http-port
containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: echo-service
spec:
ports:
- name: http-port
port: 80
targetPort: http-port
protocol: TCP
selector:
app: echo-server
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
konghq.com/plugins: middleman-plugin
name: echo-service
spec:
rules:
- host: "poc.example.com"
http:
paths:
- path: /
pathType: ImplementationSpecific
backend:
service:
name: echo-service
port:
number: 80
Do let me know if the plugin yaml is required for the setup on k8s cluster.
The latency numbers that you report don't tell me that the Prometheus plugin includes the upstream time, but rather that the plugin has drastic negative effects on Kong Gateway's overall performance. The reason most likely is that it includes a http client that is not interacting with OpenResty well, causing proxy performance to be massively affected. I would recommend rewriting the plugin to use lua-resty-http. This will greatly simplify the code and likely solve your performance issues.
I was having a look at the response headers which gives the X-Kong-Upstream-Latency and X-Kong-Proxy-Latency. The X-Kong-Proxy Latency is actually very consistent, around below 220ms which is expected, but the upstream latency > 500ms. This is with the middleman plugin. I also updated the httpclient.lua to make use of the lua-resty-http and still similar results. Trying to modify the plugin a bit by adding keepalive pool will check if that helps.
Will run few tests by having 2 different instances serving the hello world endpoint and the middleman to make sure to get cleaner results
Update: Both the above changes didn't help.
Please provide us with a complete, self-contained way to reproduce. We're not normally using Kubernetes for our Gateway testing, so we either need a reproduction setup that includes instructions how to set up K8s locally or that uses docker compose
.
@hanshuebner Update from my end, looks like it is a plugin issue, updated the script to use cjson instead of the json script I had shared and made use of the neturl library instead of the https://github.com/lunarmodules/luasocket/blob/master/src/url.lua module and things seem to be good now.
But one small thing I observed is the P95 of the Kong Proxy Latency is showing up as more than the P95 Request Latency. I could share the updated plugin script here.
We can make use of docker compose itself, but I am kind of doing a load test on the api - 10tps currently.
If there is anything you want us to look at, please share reproduction steps and relevant scripts / configuration files. Thank you!
Attaching the updated plugin scripts here. You could continue to use the same test env that was previously setup. All I want to check is whether the metrics reported by the prometheus plugin is valid. I ran a 10 TPS test with the test environment that was shared above. To me it seems like the kong proxy latency shown by the plugin is not too accurate.
I ran your updated plugin to see whether I can reproduce the results. I ran wrk
to generate some load and then queried the Prometheus metrics endpoint to determine the latency information:
cadet 1190_% docker compose exec -u 0 -it client bash
root@4186f90bf3fe:/# wrk -c64 -t64 http://kong:8000/
Running 10s test @ http://kong:8000/
64 threads and 64 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 220.86ms 23.93ms 325.92ms 86.42%
Req/Sec 4.02 0.66 5.00 59.86%
2880 requests in 10.10s, 0.88MB read
Requests/sec: 285.17
Transfer/sec: 89.50KB
root@4186f90bf3fe:/# http kong:8100/metrics
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Connection: keep-alive
Content-Type: text/plain; charset=UTF-8
Date: Mon, 01 Jul 2024 07:51:46 GMT
Server: kong/3.7.1
Transfer-Encoding: chunked
X-Kong-Admin-Latency: 6
# HELP kong_kong_latency_ms Latency added by Kong and enabled plugins for each service/route in Kong
# TYPE kong_kong_latency_ms histogram
kong_kong_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="200"} 2
kong_kong_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="500"} 2943
[...]
# HELP kong_request_latency_ms Total latency incurred during requests for each service/route in Kong
# TYPE kong_request_latency_ms histogram
kong_request_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="250"} 2627
kong_request_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="400"} 2943
[...]
# HELP kong_upstream_latency_ms Latency added by upstream response for each service/route in Kong
# TYPE kong_upstream_latency_ms histogram
kong_upstream_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="25"} 2643
kong_upstream_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="50"} 2936
kong_upstream_latency_ms_bucket{service="example-service",route="example-route",workspace="default",le="80"} 2943
My takeways from this are:
The kong_latency
metric reports the time from when the connection was opened until after the load balancer has been invoked to determine the upstream to send the request to. It is uncertain to me how useful this metric is in practice, as connections may be reused.
The request_latency
is taken from nginx's $request_time
variable, which is defined as follows:
request processing time in seconds with a milliseconds resolution; time elapsed between the first bytes were read from the client and the log write after the last bytes were sent to the client
The upstream_latency
metric reports the time from when the upstream was selected and forwarding began to when Kong Gateway received the first bytes of the response.
The accuracy of the latency measurements can be influenced by the load that is put onto Kong Gateway in some situations. In particular, the reported upstream response time includes the time that the underlying software (i.e. nginx) needs to react to data received on the socket and scheduling the request handling process.
Let us know if any questions remain, otherwise please close the issue.
Discussed in https://github.com/Kong/kong/discussions/13300