Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/#kong-community
Apache License 2.0
39.11k stars 4.79k forks source link

Kong updating service downstream API does not fully update across all worker processes #7025

Closed jeremyjpj0916 closed 1 year ago

jeremyjpj0916 commented 3 years ago

Summary

Customer updated their service resource from:

https://pure-prod-pure-origin-dc1-core-prod.origin-dc1-core.company.com/claims-enrollments/v2

to

https://pure-prod-pure-origin-dc2-core-prod.origin-dc2-core.company.com/claims-enrollments/v2

Yet traffic through the Kong proxy still would route to the IP of https://pure-prod-pure-origin-dc1-core-prod.origin-dc1-core.company.com/claims-enrollments/v2 over time at a low rate, seems 1 pod of the 4 pods still did the incorrect routing and of that bad pod only 2 of the 6 worker processes exhibited the behavior of not getting the update after an extended amount of time, talking minutes to hours later.

Admin API calls seemed to return the correct backend URL every call though as well. So something to do with the cache or C* cluster_events table helping distribute it? Something to do with the processes of Kong that keeps worker processes in sync? Seen this behavior a few times fwiw.

Steps To Reproduce

Unsure for now, our environments are just Kong nodes handling a lot of traffic as well as a lot of oauth2 token traffic(maybe that clogs up the cluster_events table is one speculation I had). I think it likely a simple sandbox environment would never reproduce it, and requires gateways with active churn on number of resources and heavy utilization. Hoping a move to DBLESS may fix this issue too long term.

Additional Details & Logs

ENV Variables:

  - env:
        - name: KONG_UPSTREAM_KEEPALIVE_IDLE_TIMEOUT
          value: '30'
        - name: KONG_UPSTREAM_KEEPALIVE_MAX_REQUESTS
          value: '50000'
        - name: KONG_UPSTREAM_KEEPALIVE_POOL_SIZE
          value: '400'
        - name: KONG_WORKER_CONSISTENCY
          value: eventual
        - name: KONG_WORKER_STATE_UPDATE_FREQUENCY
          value: '5'
        - name: KONG_CASSANDRA_DATA_CENTERS
          value: 'DC1:3,DC2:3'
        - name: KONG_CASSANDRA_REPL_STRATEGY
          value: NetworkTopologyStrategy
        - name: KONG_CASSANDRA_SCHEMA_CONSENSUS_TIMEOUT
          value: '30000'
        - name: KONG_HEADERS
          value: latency_tokens
        - name: KONG_DNS_ORDER
          value: 'LAST,SRV,A,CNAME'
        - name: KONG_CASSANDRA_CONTACT_POINTS
          value: 'x,x,x,x,x,x'
        - name: KONG_LOG_LEVEL
          value: notice
        - name: KONG_PROXY_ACCESS_LOG
          value: 'off'
        - name: KONG_ADMIN_ACCESS_LOG
          value: 'off'
        - name: KONG_PROXY_ERROR_LOG
          value: /dev/stderr
        - name: KONG_ADMIN_ERROR_LOG
          value: /dev/stderr
        - name: KONG_ANONYMOUS_REPORTS
          value: 'off'
        - name: KONG_ADMIN_LISTEN
          value: '0.0.0.0:8001 deferred reuseport'
        - name: KONG_NGINX_MAIN_WORKER_PROCESSES
          value: '6'
        - name: KONG_NGINX_PROXY_REAL_IP_HEADER
          value: X-Forwarded-For
        - name: KONG_NGINX_PROXY_REAL_IP_RECURSIVE
          value: 'on'
        - name: KONG_MEM_CACHE_SIZE
          value: 1024m
        - name: KONG_SSL_CERT
          value: /usr/local/kong/ssl/kongcert.crt
        - name: KONG_SSL_CERT_KEY
          value: /usr/local/kong/ssl/kongprivatekey.key
        - name: KONG_SSL_CERT_DER
          value: /usr/local/kong/ssl/kongcertder.der
        - name: KONG_CLIENT_SSL
          value: 'off'
        - name: KONG_CLIENT_MAX_BODY_SIZE
          value: 50m
        - name: KONG_PLUGINS
          value: >-
            response-transformer,kong-siteminder-auth,stargate-waf-error-log,kong-kafka-log,mtls,kong-tx-debugger,kong-plugin-oauth,zipkin,kong-error-log,kong-oidc-implicit-token,kong-response-size-limiting,request-transformer,kong-service-virtualization,kong-cluster-drain,kong-upstream-jwt,kong-splunk-log,kong-spec-expose,kong-path-based-routing,kong-oidc-multi-idp,correlation-id,oauth2,statsd,jwt,rate-limiting,acl,request-size-limiting,request-termination,cors
        - name: KONG_PROXY_LISTEN
          value: '0.0.0.0:8000, 0.0.0.0:8443 ssl http2 deferred reuseport'
        - name: KONG_SSL_CIPHER_SUITE
          value: intermediate
        - name: KONG_CLIENT_BODY_BUFFER_SIZE
          value: 50m
        - name: KONG_ERROR_DEFAULT_TYPE
          value: text/plain
        - name: KONG_DATABASE
          value: cassandra
        - name: KONG_PG_SSL
          value: 'off'
        - name: KONG_CASSANDRA_PORT
          value: '9042'
        - name: KONG_CASSANDRA_KEYSPACE
          value: kong_prod
        - name: KONG_CASSANDRA_TIMEOUT
          value: '8000'
        - name: KONG_CASSANDRA_SSL
          value: 'on'
        - name: KONG_CASSANDRA_SSL_VERIFY
          value: 'on'
javierguerragiraldez commented 3 years ago

there was a bug about processing upstream events when the queue was above some size. should be fixed in 2.4.1

jeremyjpj0916 commented 3 years ago

Even if the service isn’t using upstream resource? No kong as LB, just route -> service -> http endpoint configured in service changes.

flrgh commented 1 year ago

:wave: hey @jeremyjpj0916, sorry we let this one go so long without a reply.

Since 2.1.4 there have been many bug fixes and stability improvements in the various mechanisms (DNS resolution, event propagation, load balancing, etc) that could be involved with this, so I wouldn't be surprised if A) this was indeed a bug and B) it has been remedied by now. Can you let us know if this is behavior you're still seeing in practice?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.