Running kong ingress controller in kubernetes with high availability - wrong node goes down, KONG goes DOWN

rb-leadr commented 1 year ago

Hello - We're using kong with annotated kubernetes ingress and service resources driving the config.

The problem we're facing is when we do a node roll, kong goes down while the postgresql pod is unavailable, taking our APIs down.

Having 3 kong pods for HA and still having a single point of failure seems like an anti-pattern.

I looked into dbless, but it looks like to do that you need to use a kong.yml config, while can't be configured in independent namespaces the way ingresses and services can.

Is there a way to either use ingress/service resources without postgres, or to make the postgres truly HA?

rainest commented 1 year ago

You can use controller-managed configuration (configuration built from Ingress resources) with DB-less. It's the default configuration even. Specifying a kong.yml directly is an alternative.

That said, Kong should not go down when Postgres is unavailable. You cannot start new Kong instances while Postgres is offline, but existing instances should continue serving traffic on a best-effort basis out of their configuration cache and reconnect to Postgres when it returns.

You can specify a read-only backup Postgres instance via the various pg_ro_ settings. Setting up a read replica should give you some more flexibility when rolling out an updated Postgres deployment.

If you're seeing that existing instances terminate when Postgres is unavailable, that's abnormal, and you should file an issue with the gateway team with replication steps and logs.

Will go ahead and close this as I think all questions from the OP are answered, but if you have additional follow-up questions respond back and we can reopen it.

rb-leadr commented 1 year ago

Hi @rainest - thanks for jumping in to help!

I've been off working on some other more pressing issues, and am now returning to this.

One factor that might play in: We use FluxCD, which automatically updates our k8s ingress resources as they change in our main branch in source control. I wonder if this is why the Kong pods in Test B start crashlooping.

I did a bit of testing to get more data - here's the output of that:

Test A (control - delete a non-pg node) Step 1: Determine node kong postgres and other pods are running on, and terminate a node other than postgres from AWS console (not kubectl) Step 2: run kubectl get pods -n kong -w Step 3: run kubectl get nodes -w Step 4: periodically run curl command to healthcheck behind kong

Test A Results

Sometimes slightly slow responses, but generally works fine Overall test result: PASS

Test B (terminate node from AWS) Step 1: Determine node kong postgres and other pods are running on, and terminate a node other than postgres from AWS console (not kubectl) Step 2: run kubectl get pods -n kong -w Step 3: run kubectl get nodes -w Step 4: periodically run curl command to healthcheck behind kong

Test B Results

Postgres goes down
one by one the kong pods begin crashlooping
Once all are crashlooping, curl command begins failing
Eventually, new node comes online
Postgres gets assigned to that node, and starts up
Pods stop crashlooping

But during results 3-5, our APIs are completely inaccessible Overall test result: FAIL

Test C (terminate node from kubectl) Step 1: Determine node kong postgres is running on, and run kubectl delete node Step 2: run kubectl get pods -n kong -w Step 3: run kubectl get nodes -w Step 4: periodically run curl command to healthcheck behind kong

Test C Results

Postgres pod shuts down gracefully, but initially could not restart, as the node was the only node in the cluster in its AZ - this is not an issue with Kong, but rather with our small EKS cluster size - I know what I need to do to solve this one.
One of the kong pods started crashlooping, but the other two continued serving as desired
Eventually the new node came up and pg started fine

Overall test result: PASS

Test D (terminate pg pod) Step 1: run kubectl delete pod kong-postgresql-0 Step 2: run kubectl get pods -n kong -w Step 3: run kubectl get nodes -w Step 4: periodically run curl command to healthcheck behind kong

Test D Results postgres shut down gracefully, and restarts fairly quickly, resulting in no disruption Overall test result: PASS

Overall, the only failure scenario that was concerning is if one of our EKS nodes fails in EC2, where the control plane is unaware that it happened, and thus tries to keep reconnecting before finally giving up - it's in this limbo time that the pods are crashlooping.

This would seem to be an exceedingly rare case, but something I am slightly concerned about due to the severity of the outcome when it does happen.

I think I do need to make a couple of adjustments to our config just as a better practice: 1) Consider using only 2 AZs instead of 3 to decrease the likelihood of not being able to spin up postgres on a node in the same AZ as the persistentvolume 2) Consider increasing capacity slightly by adding a node to the zone where the PV is located 3) Split up proxy and ingress controller containers into separate pods. (I'm not sure this would really help the specific issue at hand, but having them bundled as currently configured is a k8s antipattern. 4) Consider moving to dbless configuration 5) Consider switching to use a different postgres DB cluster controlled outside of kubernetes, since we already have one in RDS that is used by our app. 6) Consider adding a second read replica to postgres that would reside on a different node in the same AZ - not sure how to configure that. If you have any other suggestions given the symptoms and the below config, please let me know! I'd love to hear them. Thanks in advance!

Ryan

rb-leadr commented 1 year ago

@rainest Here's our helmrelease definition:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: kong
  namespace: kong
spec:
  chart:
    spec:
      version: "2.15.3"
      chart: kong
      sourceRef:
        kind: HelmRepository
        name: kong
        namespace: kong
  interval: 1h0m0s
  install:
    remediation:
      retries: 3
  upgrade:
    remediation:
      retries: 3
      strategy: "rollback"

  values:
    # Kong for Kubernetes with Kong Enterprise with Enterprise features enabled and
    # exposed via TLS-enabled Ingresses. Before installing:
    # * Several settings (search for the string "CHANGEME") require user-provided
    #   Secrets. These Secrets must be created before installation.
    # * Ingresses reference example "<service>.kong.CHANGEME.example" hostnames. These must
    #   be changed to an actual hostname that resolve to your proxy.
    # * Ensure that your session configurations create cookies that are usable
    #   across your services. The admin session configuration must create cookies
    #   that are sent to both the admin API and Kong Manager, and any Dev Portal
    #   instances with authentication must create cookies that are sent to both
    #   the Portal and Portal API.
    fullnameOverride: kong
    admin:
      annotations:
        konghq.com/protocol: https
      enabled: true
      http:
        enabled: false
      ingress:
        annotations:
          konghq.com/https-redirect-status-code: "301"
          konghq.com/protocols: https
          konghq.com/strip-path: "true"
          kubernetes.io/ingress.class: kong
          nginx.ingress.kubernetes.io/app-root: /
          nginx.ingress.kubernetes.io/backend-protocol: HTTPS
          nginx.ingress.kubernetes.io/permanent-redirect-code: "301"
        enabled: true
        hostname: kong.gateway.env.company.service
        path: /api
        tls: kong-admin-cert
      tls:
        containerPort: 8444
        enabled: true
        parameters:
        - http2
        servicePort: 8444
      type: ClusterIP
    affinity:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/instance
                operator: In
                values:
                - dataplane
            topologyKey: kubernetes.io/hostname
          weight: 100
    certificates:
      enabled: true
      issuer: kong-selfsigned-issuer
      cluster:
        enabled: true
      admin:
        enabled: true
        commonName: kong.gateway.env.company.service
      portal:
        enabled: true
        commonName: developer.gateway.env.company.service
      proxy:
        enabled: true
        commonName: gateway.env.company.dev
        dnsNames:
        - '*.gateway.env.company.dev'
        - 'env.company.dev'
        - '*.env.company.dev'
    cluster:
      enabled: true
      labels:
        konghq.com/service: cluster
      tls:
        containerPort: 8005
        enabled: true
        servicePort: 8005
      type: ClusterIP
    clustertelemetry:
      enabled: true
      tls:
        containerPort: 8006
        enabled: true
        servicePort: 8006
        type: ClusterIP
    deployment:
      kong:
        daemonset: false
        enabled: true
    enterprise:
      enabled: true
      license_secret: kong-enterprise-license
      portal:
        enabled: true
      rbac:
        admin_api_auth: basic-auth
        admin_gui_auth_conf_secret: kong-config-secret
        enabled: true
        session_conf_secret: kong-config-secret
      smtp:
        enabled: false
      vitals:
        enabled: true
    env:
      admin_access_log: /dev/stdout
      admin_api_uri: https://kong.gateway.env.company.service/api
      admin_error_log: /dev/stdout
      admin_gui_access_log: /dev/stdout
      admin_gui_error_log: /dev/stdout
      admin_gui_host: kong.gateway.env.company.service
      admin_gui_protocol: https
      admin_gui_url: https://kong.gateway.env.company.service/
      cluster_data_plane_purge_delay: 60
      cluster_listen: 0.0.0.0:8005
      cluster_telemetry_listen: 0.0.0.0:8006
      database: postgres
      log_level: debug
      lua_package_path: /opt/?.lua;;
      nginx_worker_processes: "2"
      password:
        valueFrom:
          secretKeyRef:
            key: kong_admin_password
            name: kong-config-secret
      pg_database: kong
      pg_host:
        valueFrom:
          secretKeyRef:
            key: pg_host
            name: kong-config-secret
      pg_ssl: "off"
      pg_ssl_verify: "off"
      pg_user: kong
      plugins: bundled,openid-connect
      portal: true
      portal_api_access_log: /dev/stdout
      portal_api_error_log: /dev/stdout
      portal_api_url: https://developer.gateway.env.company.service/api
      portal_auth: basic-auth
      portal_cors_origins: '*'
      portal_gui_access_log: /dev/stdout
      portal_gui_error_log: /dev/stdout
      portal_gui_host: developer.gateway.env.company.service
      portal_gui_protocol: https
      portal_gui_url: https://developer.gateway.env.company.service/
      portal_session_conf:
        valueFrom:
          secretKeyRef:
            key: portal_session_conf
            name: kong-config-secret
      prefix: /kong_prefix/
      proxy_access_log: /dev/stdout
      proxy_error_log: /dev/stdout
      proxy_stream_access_log: /dev/stdout
      proxy_stream_error_log: /dev/stdout
      smtp_mock: "on"
      status_listen: 0.0.0.0:8100
      trusted_ips: 0.0.0.0/0,::/0
      vitals: true
    extraLabels:
      konghq.com/component: kong
    image:
      repository: kong/kong-gateway
      tag: "3.0"
    ingressController:
      enabled: true
      env:
        kong_admin_filter_tag: ingress_controller_kong
        kong_admin_tls_skip_verify: true
        kong_admin_token:
          valueFrom:
            secretKeyRef:
              key: password
              name: kong-config-secret
        kong_admin_url: https://localhost:8444
        kong_workspace: default
        publish_service: kong/kong-proxy
      image:
        repository: docker.io/kong/kubernetes-ingress-controller
        tag: "2.7"
      ingressClass: kong
      installCRDs: false
    manager:
      annotations:
        konghq.com/protocol: https
      enabled: true
      http:
        containerPort: 8002
        enabled: false
        servicePort: 8002
      ingress:
        annotations:
          konghq.com/https-redirect-status-code: "301"
          kubernetes.io/ingress.class: kong
          nginx.ingress.kubernetes.io/backend-protocol: HTTPS
        enabled: true
        hostname: kong.gateway.env.company.service
        path: /
        tls: kong-admin-cert
      tls:
        containerPort: 8445
        enabled: true
        parameters:
        - http2
        servicePort: 8445
      type: ClusterIP
    migrations:
      enabled: true
      postUpgrade: true
      preUpgrade: true
    namespace: kong
    podAnnotations:
      kuma.io/gateway: enabled
      prometheus.io/port: "8100"
      prometheus.io/scrape: "true"
    portal:
      annotations:
        konghq.com/protocol: https
      enabled: true
      http:
        containerPort: 8003
        enabled: false
        servicePort: 8003
      ingress:
        annotations:
          konghq.com/https-redirect-status-code: "301"
          konghq.com/protocols: https
          konghq.com/strip-path: "false"
          kubernetes.io/ingress.class: kong
        enabled: true
        hostname: developer.gateway.env.company.service
        path: /
        tls: kong-portal-cert
      tls:
        containerPort: 8446
        enabled: true
        parameters:
        - http2
        servicePort: 8446
      type: ClusterIP
    portalapi:
      annotations:
        konghq.com/protocol: https
      enabled: true
      http:
        enabled: false
      ingress:
        annotations:
          konghq.com/https-redirect-status-code: "301"
          konghq.com/protocols: https
          konghq.com/strip-path: "true"
          kubernetes.io/ingress.class: kong
          nginx.ingress.kubernetes.io/app-root: /
        enabled: true
        hostname: developer.gateway.env.company.service
        path: /api
        tls: kong-portal-cert
      tls:
        containerPort: 8447
        enabled: true
        parameters:
        - http2
        servicePort: 8447
      type: ClusterIP
    postgresql:
      enabled: true
      auth:
        database: kong
        username: kong
    proxy:
      annotations:
        prometheus.io/port: "8100"
        prometheus.io/scrape: "true"
        service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "https"
        service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws:acm:us-west-2:<awsAccount>:certificate/<cert-guid>"
        service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"
        service.beta.kubernetes.io/aws-load-balancer-type: "alb"
      enabled: true
      http:
        containerPort: 8080
        enabled: false
      ingress:
        enabled: false
      labels:
        enable-metrics: true
      tls:
        containerPort: 8080
        enabled: true
      type: LoadBalancer
    serviceMonitor:
      enabled: true
      additionalLabels:
        app.kubernetes.io/part-of: kube-prometheus-stack
        interval: 10s
        namespace: monitoring
    replicaCount: 3
    secretVolumes: []
    status:
      enabled: true
      http:
        containerPort: 8100
        enabled: true
      tls:
        containerPort: 8543
        enabled: false
    updateStrategy:
      rollingUpdate:
        maxSurge: 100%
        maxUnavailable: 100%
      type: RollingUpdate

Kong / charts