Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/#kong-community
Apache License 2.0
39.05k stars 4.79k forks source link

Sporadic failure of DNS resolution - kong 0.11.1 in OpenShift #3072

Closed abustya closed 6 years ago

abustya commented 6 years ago

Summary

Opening a new issue after this one has been closed: https://github.com/Kong/kong/issues/2524

I am running kong in an OpenShift cluster, and I am still encountering random DNS resolution errors with version 0.11.1.

Steps To Reproduce

Error occurs when calling either the admin api (e.g. '/apis') or one of the proxied apis.

Additional Details & Logs

Error occurs for about 2% of calls. When running lots of calls subsequently, I see that the errors mostly occur in batches: for about 0.5-1 second all the calls fail, and then all is well again.

subnetmarco commented 6 years ago

Pinging @Tieske

Tieske commented 6 years ago

@abustya dns server error: 3 name error means the server did send an answer, but the answer was either empty or didn't contain the requested name. dns server error: 2 server failure indicates that your dns server ran into an error.

From the logs it appears as if this only happens when looking up the postgres database.

  1. Do you see other occurrences in the logs?
  2. What is the dns part of your Kong config file?
  3. and what environment variables on top of that are you using, if any?
  4. can you share the resolv.conf file you have on your system?
  5. you have 2 name servers (10.1.0.5 and 168.63.129.16), does the error persist if you disable either one of those?

In all honesty, looks like a problem with your nameservers.

abustya commented 6 years ago
  1. Yes, the last line in the log: name resolution failed for 'exchange-api': dns server error: 3 name error. This is one of the backend api-s configured to be proxied.

  2. None of the dns_* properties are customized. (Actually, I don't even have a kong.conf file, only the kong.conf.default, untouched.)

  3. Env vars:

    KONG_DATABASE=postgres
    KONG_LUA_SSL_TRUSTED_CERTIFICATE=/etc/pki/tls/certs/ca-bundle.crt
    KONG_LUA_SSL_VERIFY_DEPTH=3
    KONG_PG_HOST=kong-database
    KONG_PG_PASSWORD=kong
    KONG_PG_USER=kong
  4. resolv.conf contents:

    search ci.svc.cluster.local svc.cluster.local cluster.local ua5hp3m0b0butcqmu5iwql5ykd.ax.internal.cloudapp.net
    nameserver 10.1.0.5
    nameserver 168.63.129.16
    options ndots:5
  5. The first nameserver is responsible for resolution of domains inside the cluster, the secord for outside. If I only leave the one for inside resolution, the error no longer occurs. If I only leave the one for outside, the error occurs constantly.

Only leaving the fist nameserver actually seems like a viable workaround at the moment, though I think later I will also need to add api-s from outsige the cluster.

Tieske commented 6 years ago

You should reconfigure your dns, this will never work.

Actually it appears because of a bug in the dns resolver, if that bug wouldn't have been around, it might have worked, but only because of retries being done. So it would only have masked the bad configuration, and you'd have very high dns resolution latency.

The dns client will randomly pick a dns server to resolve names (to spread the load), so in cases where it picks your "outside" server, it will obviously fail to resolve internal names, because they are unknown at that server.

You should always use the internal server, and configure that server to lookup on your external server (in a chained fashion).

Closing this now. If you think this is not resolved, then please feel free to reopen.