Closed jeremyjpj0916 closed 1 year ago
there was a bug about processing upstream events when the queue was above some size. should be fixed in 2.4.1
Even if the service isn’t using upstream resource? No kong as LB, just route -> service -> http endpoint configured in service changes.
:wave: hey @jeremyjpj0916, sorry we let this one go so long without a reply.
Since 2.1.4 there have been many bug fixes and stability improvements in the various mechanisms (DNS resolution, event propagation, load balancing, etc) that could be involved with this, so I wouldn't be surprised if A) this was indeed a bug and B) it has been remedied by now. Can you let us know if this is behavior you're still seeing in practice?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Summary
Customer updated their service resource from:
https://pure-prod-pure-origin-dc1-core-prod.origin-dc1-core.company.com/claims-enrollments/v2
to
https://pure-prod-pure-origin-dc2-core-prod.origin-dc2-core.company.com/claims-enrollments/v2
Yet traffic through the Kong proxy still would route to the IP of https://pure-prod-pure-origin-dc1-core-prod.origin-dc1-core.company.com/claims-enrollments/v2 over time at a low rate, seems 1 pod of the 4 pods still did the incorrect routing and of that bad pod only 2 of the 6 worker processes exhibited the behavior of not getting the update after an extended amount of time, talking minutes to hours later.
Admin API calls seemed to return the correct backend URL every call though as well. So something to do with the cache or C* cluster_events table helping distribute it? Something to do with the processes of Kong that keeps worker processes in sync? Seen this behavior a few times fwiw.
Steps To Reproduce
Unsure for now, our environments are just Kong nodes handling a lot of traffic as well as a lot of oauth2 token traffic(maybe that clogs up the cluster_events table is one speculation I had). I think it likely a simple sandbox environment would never reproduce it, and requires gateways with active churn on number of resources and heavy utilization. Hoping a move to DBLESS may fix this issue too long term.
Additional Details & Logs
ENV Variables: