Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/#kong-community
Apache License 2.0
39.01k stars 4.79k forks source link

KIC fails to start. All pods down: nginx [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)" #13730

Open joran-fonjallaz opened 1 day ago

joran-fonjallaz commented 1 day ago

Is there an existing issue for this?

Kong version ($ kong version)

3.8.0

Current Behavior

hello, We run kong KIC on GKE clusters: every night the preemptible nodes are reclaimed in our staging envs. And most of the time, it takes down all kong gateway pods (2 replicas) for hours.

versions

Additional info

It seems that the liveness probe is responding ok, while the readiness probe remains unhealthy, leading to the gateway pods to just remain around, not able to process traffic.

Error logs

ERROR 2024/10/05 00:06:53 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
ERROR nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
[repeats over and over, yet the pod is not killed]

The controller fails to talk to the gateways with

ERROR 2024-10-07T03:37:26.870415241Z [resource.labels.containerName: ingress-controller] Error: could not retrieve Kong admin root(s): making HTTP request: Get "https://10.163.37.7:8444/": dial tcp 10.163.37.7:8444: connect: connection refused

Kong finds itself in some sort of "deadlock" until the pods are deleted manually. Any insights ?

Below is the values.yaml file configuring kong

ingress:
  deployment:
    test:
      enabled: false

  controller:
    enabled: true

    proxy:
      nameOverride: "{{ .Release.Name }}-gateway-proxy"

    postgresql:
      enabled: false

    env:
      database: "off"

    deployment:
      kong:
        enabled: false

    ingressController:
      enabled: true

      image:
        repository: kong/kubernetes-ingress-controller
        tag: "3.3.1"
        pullPolicy: IfNotPresent

      resources:
        requests:
          cpu: 100m
          memory: 256Mi
        limits:
          memory: 1G

      ingressClass: kong-green

      env:
        log_format: json
        log_level: error
        ingress_class: kong-green
        gateway_api_controller_name: konghq.com/kong-green

      gatewayDiscovery:
        enabled: true
        generateAdminApiService: true

    podAnnotations:
      sidecar.istio.io/inject: "false"

  gateway:
    enabled: true

    deployment:
      kong:
        enabled: true

    image:
      repository: kong
      tag: "3.8.0"
      pullPolicy: IfNotPresent

    resources:
      requests:
        cpu: 250m
        memory: 500Mi
      limits:
        memory: 2G

    replicaCount: 6

    podAnnotations:
      sidecar.istio.io/inject: "false"

    proxy:
      enabled: true
      type: ClusterIP
      annotations:
        konghq.com/protocol: "https"
        cloud.google.com/neg: '{"exposed_ports": {"80":{"name": "neg-kong-green"}}}'
      http:
        enabled: true
        servicePort: 80
        containerPort: 8000
        parameters: []
      tls:
        enabled: true
        servicePort: 443
        containerPort: 8443
        parameters:
          - http2
        appProtocol: ""

    ingressController:
      enabled: false

    postgresql:
      enabled: false

    env:
      role: traditional
      database: "off"
      proxy_access_log: "off"
#      proxy_error_log: "off"
      proxy_stream_access_log: "off"
#      proxy_stream_error_log: "off"
      admin_access_log: "off"
#      admin_error_log: "off"
      status_access_log: "off"
#      status_error_log: "off"
      log_level: warn
      headers: "off"
      request_debug: "off"

Expected Behavior

kong gateway pods, either

Steps To Reproduce

I could reproduce the error by killing the nodes (kubectl delete nodes) on which the kong pods were running. After killing the nodes, KIC fails to restart as it enters the deadlock situation described above. See screenshot:

image

Anything else?

dump of a failing gateway pod: kubectl describe:

k -n kong-dbless describe po kong-green-gateway-68f467ff98-qztm5
Name:             kong-green-gateway-68f467ff98-qztm5
Namespace:        kong-dbless
Priority:         0
Service Account:  kong-green-gateway
Node:             ---
Start Time:       Mon, 07 Oct 2024 13:49:02 +0200
Labels:           app=kong-green-gateway
                  app.kubernetes.io/component=app
                  app.kubernetes.io/instance=kong-green
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=gateway
                  app.kubernetes.io/version=3.6
                  helm.sh/chart=gateway-2.41.1
                  pod-template-hash=68f467ff98
                  version=3.6
Annotations:      cni.projectcalico.org/containerID: 13864002653403e75b1ddb3ef661b5665f69e3b97c266b5833042f8dc4a4f39b
                  cni.projectcalico.org/podIP: 10.163.33.135/32
                  cni.projectcalico.org/podIPs: 10.163.33.135/32
                  kuma.io/gateway: enabled
                  kuma.io/service-account-token-volume: kong-green-gateway-token
                  sidecar.istio.io/inject: false
                  traffic.sidecar.istio.io/includeInboundPorts:
Status:           Running
IP:               10.163.33.135
IPs:
  IP:           10.163.33.135
Controlled By:  ReplicaSet/kong-green-gateway-68f467ff98
Init Containers:
  clear-stale-pid:
    Container ID:    containerd://ed0b35719cd87e11e849b42f20f1f328b1e2d63612d004b313ba981eda0bd790
    Image:           kong:3.8.0
    Image ID:        docker.io/library/kong@sha256:616b2ab5a4c7b6c14022e8a1495ff34930ced76f25f3d418e76758717fec335f
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    Command:
      rm
      -vrf
      $KONG_PREFIX/pids
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 07 Oct 2024 13:49:20 +0200
      Finished:     Mon, 07 Oct 2024 13:49:21 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  2G
    Requests:
      cpu:     250m
      memory:  500Mi
    Environment:
      KONG_ADMIN_ACCESS_LOG:         /dev/stdout
      KONG_ADMIN_ERROR_LOG:          /dev/stderr
      KONG_ADMIN_GUI_ACCESS_LOG:     /dev/stdout
      KONG_ADMIN_GUI_ERROR_LOG:      /dev/stderr
      KONG_ADMIN_LISTEN:             0.0.0.0:8444 http2 ssl, [::]:8444 http2 ssl
      KONG_CLUSTER_LISTEN:           off
      KONG_DATABASE:                 off
      KONG_LUA_PACKAGE_PATH:         /opt/?.lua;/opt/?/init.lua;;
      KONG_NGINX_WORKER_PROCESSES:   2
      KONG_PLUGINS:                  ---
      KONG_PORTAL_API_ACCESS_LOG:    /dev/stdout
      KONG_PORTAL_API_ERROR_LOG:     /dev/stderr
      KONG_PORT_MAPS:                80:8000, 443:8443
      KONG_PREFIX:                   /kong_prefix/
      KONG_PROXY_ACCESS_LOG:         /dev/stdout
      KONG_PROXY_ERROR_LOG:          /dev/stderr
      KONG_PROXY_LISTEN:             0.0.0.0:8000, [::]:8000, 0.0.0.0:8443 http2 ssl, [::]:8443 http2 ssl
      KONG_PROXY_STREAM_ACCESS_LOG:  /dev/stdout basic
      KONG_PROXY_STREAM_ERROR_LOG:   /dev/stderr
      KONG_ROLE:                     traditional
      KONG_ROUTER_FLAVOR:            traditional
      KONG_STATUS_ACCESS_LOG:        off
      KONG_STATUS_ERROR_LOG:         /dev/stderr
      KONG_STATUS_LISTEN:            0.0.0.0:8100, [::]:8100
      KONG_STREAM_LISTEN:            off
    Mounts:
      /kong_prefix/ from kong-green-gateway-prefix-dir (rw)
      /opt/kong/plugins/---
      /opt/kong/plugins/---
      /opt/kong/plugins/---
      /opt/kong/plugins/---
      /opt/kong/plugins/---
      /tmp from kong-green-gateway-tmp (rw)
Containers:
  proxy:
    Container ID:    containerd://0ed944478d25423c08c85146ed1528ae668d128f13bddaf6402990701e2ea3a1
    Image:           kong:3.8.0
    Image ID:        docker.io/library/kong@sha256:616b2ab5a4c7b6c14022e8a1495ff34930ced76f25f3d418e76758717fec335f
    Ports:           8444/TCP, 8000/TCP, 8443/TCP, 8100/TCP
    Host Ports:      0/TCP, 0/TCP, 0/TCP, 0/TCP
    SeccompProfile:  RuntimeDefault
    State:           Waiting
      Reason:        CrashLoopBackOff
    Last State:      Terminated
      Reason:        Error
      Exit Code:     1
      Started:       Mon, 07 Oct 2024 13:59:39 +0200
      Finished:      Mon, 07 Oct 2024 13:59:49 +0200
    Ready:           False
    Restart Count:   7
    Limits:
      memory:  2G
    Requests:
      cpu:      250m
      memory:   500Mi
    Liveness:   http-get http://:status/status delay=5s timeout=5s period=10s #success=1 #failure=3
    Readiness:  http-get http://:status/status/ready delay=5s timeout=5s period=10s #success=1 #failure=3
    Environment:
      KONG_ADMIN_ACCESS_LOG:         /dev/stdout
      KONG_ADMIN_ERROR_LOG:          /dev/stderr
      KONG_ADMIN_GUI_ACCESS_LOG:     /dev/stdout
      KONG_ADMIN_GUI_ERROR_LOG:      /dev/stderr
      KONG_ADMIN_LISTEN:             0.0.0.0:8444 http2 ssl, [::]:8444 http2 ssl
      KONG_CLUSTER_LISTEN:           off
      KONG_DATABASE:                 off
      KONG_LUA_PACKAGE_PATH:         /opt/?.lua;/opt/?/init.lua;;
      KONG_NGINX_WORKER_PROCESSES:   2
      KONG_PLUGINS:                 ---
      KONG_PORTAL_API_ACCESS_LOG:    /dev/stdout
      KONG_PORTAL_API_ERROR_LOG:     /dev/stderr
      KONG_PORT_MAPS:                80:8000, 443:8443
      KONG_PREFIX:                   /kong_prefix/
      KONG_PROXY_ACCESS_LOG:         /dev/stdout
      KONG_PROXY_ERROR_LOG:          /dev/stderr
      KONG_PROXY_LISTEN:             0.0.0.0:8000, [::]:8000, 0.0.0.0:8443 http2 ssl, [::]:8443 http2 ssl
      KONG_PROXY_STREAM_ACCESS_LOG:  /dev/stdout basic
      KONG_PROXY_STREAM_ERROR_LOG:   /dev/stderr
      KONG_ROLE:                     traditional
      KONG_ROUTER_FLAVOR:            traditional
      KONG_STATUS_ACCESS_LOG:        off
      KONG_STATUS_ERROR_LOG:         /dev/stderr
      KONG_STATUS_LISTEN:            0.0.0.0:8100, [::]:8100
      KONG_STREAM_LISTEN:            off
      KONG_NGINX_DAEMON:             off
    Mounts:
      /kong_prefix/ from kong-green-gateway-prefix-dir (rw)
      /opt/kong/plugins/---
      /opt/kong/plugins/---
      /opt/kong/plugins/---
      /opt/kong/plugins/---
      /opt/kong/plugins/---
      /tmp from kong-green-gateway-tmp (rw)
Readiness Gates:
  Type                                       Status
  cloud.google.com/load-balancer-neg-ready   True
Conditions:
  Type                                       Status
  cloud.google.com/load-balancer-neg-ready   True
  PodReadyToStartContainers                  True
  Initialized                                True
  Ready                                      False
  ContainersReady                            False
  PodScheduled                               True
Volumes:
  kong-green-gateway-prefix-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  256Mi
  kong-green-gateway-tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  1Gi
  kong-green-gateway-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                             Age                    From                                   Message
  ----     ------                             ----                   ----                                   -------
  Normal   Killing                            11m                    kubelet                                Container proxy failed liveness probe, will be restarted
  Warning  FailedPreStopHook                  10m                    kubelet                                PreStopHook failed
  Normal   Pulled                             10m (x2 over 11m)      kubelet                                Container image "kong:3.8.0" already present on machine
  Normal   Created                            10m (x2 over 11m)      kubelet                                Created container proxy
  Normal   Started                            10m (x2 over 11m)      kubelet                                Started container proxy
  Warning  Unhealthy                          10m (x4 over 11m)      kubelet                                Liveness probe failed: Get "http://10.163.33.135:8100/status": dial tcp 10.163.33.135:8100: connect: connection refused
  Warning  Unhealthy                          10m (x9 over 11m)      kubelet                                Readiness probe failed: Get "http://10.163.33.135:8100/status/ready": dial tcp 10.163.33.135:8100: connect: connection refused
  Warning  BackOff                            114s (x26 over 7m19s)  kubelet                                Back-off restarting failed container proxy in pod kong-green-gateway-68f467ff98-qztm5_kong-dbless(ab152a94-7ef0-4de0-b84c-1eb419327b88)

and logs

2024/10/07 12:05:10 [warn] 1#0: the "user" directive makes sense only if the master process runs with super-user privileges, ignored in /kong_prefix/nginx.conf:7
nginx: [warn] the "user" directive makes sense only if the master process runs with super-user privileges, ignored in /kong_prefix/nginx.conf:7
2024/10/07 12:05:14 [notice] 1#0: [lua] init.lua:791: init(): [request-debug] token for request debugging: ccbb05a0-6e76-4cb7-9e5d-346690a3c69f
2024/10/07 12:05:14 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
2024/10/07 12:05:14 [notice] 1#0: try again to bind() after 500ms
2024/10/07 12:05:14 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
2024/10/07 12:05:14 [notice] 1#0: try again to bind() after 500ms
2024/10/07 12:05:14 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
2024/10/07 12:05:14 [notice] 1#0: try again to bind() after 500ms
2024/10/07 12:05:14 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
2024/10/07 12:05:14 [notice] 1#0: try again to bind() after 500ms
2024/10/07 12:05:14 [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
nginx: [emerg] bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
2024/10/07 12:05:14 [notice] 1#0: try again to bind() after 500ms
2024/10/07 12:05:14 [emerg] 1#0: still could not bind()
nginx: [emerg] still could not bind()
joran-fonjallaz commented 1 day ago

FYI: the "deadlock" is removed by restarting the pods manually

kubectl -n kong-dbless delete pods --selector=app.kubernetes.io/instance=kong-green

then kong KIC pods (controller and gateways) restart normally.

StarlightIbuki commented 16 hours ago

It seems your issue has been resolved. Feel free to reopen if you have any further concern.

joran-fonjallaz commented 16 hours ago

thanks @StarlightIbuki for taking this issue. However, your answer doesn't help much. Could you please point me to the resolution ? How has this issue been solved ? And what's the fix ?

Thank you in advance

StarlightIbuki commented 14 hours ago

thanks @StarlightIbuki for taking this issue. However, your answer doesn't help much. Could you please point me to the resolution ? How has this issue been solved ? And what's the fix ?

Thank you in advance

Sorry I thought you had found the solution. @randmonkey Could you also take a look into this?