Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/
Apache License 2.0
39.32k stars 4.82k forks source link

Logs full of timer and DNS client errors #11104

Closed gorban closed 1 year ago

gorban commented 1 year ago

Is there an existing issue for this?

Kong version ($ kong version)

3.3.0

Current Behavior

I am running Kong in Kubernetes via helm chart 2.23.0.

Values.yaml overrides from default:

image:
  repository: kong
  tag: "3.3.0-alpine"

The logs are full of errors like:

2023/06/21 19:29:50 [error] 1255#0: 38475 [lua] job.lua:284: execute(): [timer-ng] failed to run timer unix_timestamp=1687375790622.000000;counter=14187:meta=debug off: /usr/local/share/lua/5.1/kong/resty/dns/client.lua:775: attempt to call method 'destroy' (a nil value), context: ngx.timer 2023/06/22 13:42:42 [error] 1255#0: 251 [lua] handler.lua:87: handler(): [otel] failed to send request: [cosocket] DNS resolution failed: dns server error: 3 name error. Tried: ["(short)collection-sumologic-otelcol.sumologic.svc.cluster.local:(na) - cache-miss","collection-sumologic-otelcol.sumologic.svc.cluster.local.kong.svc.cluster.local:33 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local.svc.cluster.local:33 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local.cluster.local:33 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local:33 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local.kong.svc.cluster.local:1 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local.svc.cluster.local:1 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local.cluster.local:1 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local:1 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local.kong.svc.cluster.local:5 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local.svc.cluster.local:5 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local.cluster.local:5 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local:5 - cache-hit/dns server error: 3 name error"], context: ngx.timer 2023/06/22 13:42:42 [warn] 1255#0: 251 [kong] queue.lua:188 [opentelemetry plugin c3a1003d-17bc-5f0c-a428-70c51c053022] queue c3a1003d-17bc-5f0c-a428-70c51c053022: handler could not process entries: failed to send request: [cosocket] DNS resolution failed: dns server error: 3 name error. Tried: ["(short)collection-sumologic-otelcol.sumologic.svc.cluster.local:(na) - cache-miss","collection-sumologic-otelcol.sumologic.svc.cluster.local.kong.svc.cluster.local:33 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local.svc.cluster.local:33 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local.cluster.local:33 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local:33 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local.kong.svc.cluster.local:1 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local.svc.cluster.local:1 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local.cluster.local:1 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local:1 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local.kong.svc.cluster.local:5 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local.svc.cluster.local:5 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local.cluster.local:5 - cache-hit/dns server error: 3 name error","collection-sumologic-otelcol.sumologic.svc.cluster.local:5 - cache-hit/dns server error: 3 name error"], context: ngx.timer 127.0.0.1 - - [21/Jun/2023:19:29:51 +0000] "GET /status HTTP/2.0" 200 1236 "-" "Go-http-client/2.0" 2023/06/21 19:29:52 [error] 1254#0: 427 [lua] handler.lua:87: handler(): [otel] response error: 404, body: 404 page not found, context: ngx.timer 2023/06/21 19:29:52 [warn] 1254#0: *427 [kong] queue.lua:188 [opentelemetry plugin c3a1003d-17bc-5f0c-a428-70c51c053022] queue c3a1003d-17bc-5f0c-a428-70c51c053022: handler could not process entries: response error: 404, body: 404 page not found, context: ngx.timer

I don't know if it's relevant, but my configuration includes Kubernetes custom resource:

apiVersion: configuration.konghq.com/v1
kind: KongClusterPlugin
metadata:
  name: opentelemetry
  annotations:
    kubernetes.io/ingress.class: kong
  labels:
    global: "true"
config:
  endpoint: http://collection-sumologic-otelcol.sumologic:4318/v1/traces
plugin: opentelemetry

I also have a kong.conf with values:

tracing_instrumentations = all
tracing_sampling_rate = 0.25

If I kubectl exec into an sh terminal on the Kong "proxy" container on a pod, I am able to use wget to post an OpenTelemetry trace JSON request to the same endpoint address as my opentelemetry config endpoint, and I can confirm the OTEL collector processed it successfully. But Kong still gets 404 at that address. The OTEL collector is a Kubernetes cluster local service named collection-sumologic-otelcol in another namespace sumologic. I have also tried with full DNS name collection-sumologic-otelcol.sumologic.svc.cluster.local only to see the same behavior.

Expected Behavior

The logs should not be filled with timer / DNS client errors.

I suspect possibly 3 issues.

First, I suspect in the Kong code itself, that it should not try to r:destroy() at the end of this code block. It has explicit handling for not r (including r is nil), but then seems to use the r it just checked to be bad:

https://github.com/Kong/kong/blob/master/kong/resty/dns/client.lua#LL743C18-L743C38

local r, err = resolver:new(config)
if not r then
  ...
else
  ...
end

...
r:destroy()

Second, resolver:new(config) should not be returning a nil result.

Third, if I can send a trace manually to an OpenTelemetry collector with an identical endpoint via curl from the Kong pod's proxy container itself, then Kong should be successfully sending to that endpoint too. In case it's an issue in my test where I'm sending JSON to the OTEL HTTP endpoint instead of sending ProtoBuf like Kong does, I tested a service in the same namespace as Kong using OpenTelemetry instrumentation in a .NET 6 application. It is able to send http / protobuf to the same domain / port 4318 / same path.

Steps To Reproduce

  1. Make a copy of latest Helm chart values.yaml and modify with Kong image 3.3.0-alpine
  2. Kubectl to create a new kong namespace
  3. Helm install Kong with latest helm chart into the kong namespace, and use the copied & modified values.yaml
  4. Also Helm install an OpenTelemetry collector to another namespace, i.e. Jaeger all-in-one as suggested by the Kong's OpenTelemetry plugin setup instructions
  5. Kubectl create/apply a custom resource to enable opentelemetry plugin globally (i.e. the KongClusterPlugin following Kong's OpenTelemetry plugin setup instructions).
  6. Kubectl patch the Kong deployment for the proxy container to enable OpenTelemetry tracing. For example, with environment variables ({"name": "KONG_TRACING_INSTRUMENTATIONS", "value": "all" }) or add a kong.config file via volume mount to path /etc/kong/kong.conf (with value tracing_instrumentations = all)
  7. Observe the logs and observe the OpenTelemetry traces (which aren't working)

Anything else?

No response

gorban commented 1 year ago

I looked into rebuilding Kong from source to remove the errors for destroying a nil DNS resolver. I check before trying to destroy it, but it still gives the same error message on the new destroy line. Code:

if r then
  r:destroy()
end

Error (notice destroy moved down one line from 775 to 776 due to my code change):

2023/06/23 03:05:28 [error] 1257#0: *28 [lua] job.lua:284: execute(): [timer-ng] failed to run timer unix_timestamp=1687489528335.000000;counter=1066:meta=debug off: /usr/local/share/lua/5.1/kong/resty/dns/client.lua:776: attempt to call method 'destroy' (a nil value), context: ngx.timer

I see this code comment that could be related to this behavior, will start looking:

--resolver:destroy is patched in build phase, more information can be found in
-- build/openresty/patches/lua-resty-dns-0.22_01-destory_resolver.patch
gorban commented 1 year ago

That patch adds a function function _M:destroy().

It looks like when I build Kong myself, the patches aren't being applied. Still getting the 404 on the exact same OpenTelemetry endpoint I confirmed separately though:

2023/06/23 13:01:44 [error] 1254#0: *537 [lua] handler.lua:87: handler(): [otel] response error: 404, body: 404 page not found , context: ngx.timer

The following code finds the effect of the patch on kong:3.3.0-alpine, but not my local build. Need to figure out how to debug the patching during the build process... We need ARM support, so we have to build it ourselves since 3.1.1 was the last version with an ARM .apk. Not to mention we build it ourselves to ensure no extra extra developer tooling installed on the images.

grep -e 'destroy' /usr/local/openresty/lualib/resty/dns/resolver.lua
gorban commented 1 year ago

The official Kong example gives port/path: :4318/v1/traces

Added tracing of the OpenTelemetry plugin calls to the configured collector.

I don't see any traffic to port 4318, but instead to port 14268. That port is for Thrift over HTTP which actually expects path /api/traces. So that explains why Kong is getting a 404 Not Found. If I change my config to: :4318/api/traces I get a 400 Bad Request with response body:

Unsupported content type: application/x-protobuf

I would still like to know what part of the kong-build-tools is responsible for applying patches like lua-resty-dns-0.22_01-destroy_resolver.patch so local builds don't have those DNS errors. FYI, I can reproduce the incorrect OpenTelemetry port on the official Kong image too.

ChayanBansal commented 1 year ago

@gorban I don't know if you were able to resolve this issue or not but I had been facing the same issue with the Bitnami's Kong which I was using as the base image. Tried a bunch of things, later to realize the latest patches are missing!

A not so ideal solution that worked for me is downloading the latest resolver.lua file from the lua-resty-dns repo and replacing that with the exiting one:

RUN curl -L -o /opt/bitnami/kong/openresty/lualib/resty/dns/resolver.lua https://raw.githubusercontent.com/openresty/lua-resty-dns/master/lib/resty/dns/resolver.lua

For you the resolver.lua file path may be different - you can find it using find / -name resolver.lua

Just an FYI for other people who might come to this thread - this destroy method is extremely important even though your requests might be going thru (successfully) because this removes the resources associated with the dns resolver. Not having this function will lead to memory leaks!

gorban commented 1 year ago

@ChayanBansal , no. We pivoted. Instead of trying to get Kong 3.3.0-alpine working, we just found alternatives to OpenTelemetry. It wasn't working on 3.1.1-alpine we used to use, possibly because of 3.3.0.0 changelog issue:

Fixed an issue where reconfiguring the plugin didn’t take effect.

Kong 3.1.1-alpine has ARM + AMD64 support, so it doesn't require a custom build.

StarlightIbuki commented 1 year ago

@gorban @ChayanBansal We have stopped packing images for Alpine for 3.4, so it's recommended to migrate to other variants. We had a package issue of 3.0. Updating to a newer version should fix this.

github-actions[bot] commented 1 year ago

This issue is marked as stale because it has been open for 14 days with no activity.

github-actions[bot] commented 1 year ago

Dear contributor,

We are automatically closing this issue because it has not seen any activity for three weeks. We're sorry that your issue could not be resolved. If any new information comes up that could help resolving it, please feel free to reopen it.

Your contribution is greatly appreciated!

Please have a look our pledge to the community for more information.

Sincerely, Your Kong Gateway team