Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/#kong-community
Apache License 2.0
38.78k stars 4.77k forks source link

Kong throws 404 not found on route paths that exist at times. #4055

Closed jeremyjpj0916 closed 5 years ago

jeremyjpj0916 commented 5 years ago

Summary

We have noticed in our Kong Gateway nodes times when common endpoints the gateway exposes throwing 404 route not found on a % of API calls. The specific proxy we focused on with this post( "/F5/status") does not route, has no auth, and serves as a ping up time endpoint that returns static 200 success. We do notice this as well on other endpoints that have auth and plugins as well, but the frequency in which our ping endpoint gets consumed is consistent and provides the best insight.

Steps To Reproduce

Reproducing consistently seems impossible from our perspective at this time, but we will elaborate with as much detail and screenshots as we can.

  1. Create combination of services and resources 1 to 1. We see 130 service route pairs. 300 plugins. Some global, some applied directly to routes(acl/auth).

  2. Add additional services + routes pairs with standard plugins over extended time.

  3. Suspicion is eventually the 404's will reveal themselves, and we have a high degree of confidence it does not happen globally across all worker processes.

New arbitrary Service+Route was made on the Gateway, specific important to note is created timestamp: servicecreatedattimestamp

Converted to UTC: timestampconverted

Above time matches identically with when the existing "/F5/status" began throwing 404 route not found errors: f5statuserrorrateoccurs

You can see a direct correlation to when that new service+route pair was created to when the gateway began to throw 404 not found errors on a route that exists and previously had no problems. Note the "/F5/status" endpoint takes consistent traffic at all time from health check monitors.

Interesting bit its the % of errors to this individual Kong Node, we run 6 worker processes and the error rate % is almost perfect for 1 worker process showing impact: f5status404rate

To discuss our architecture we run Kong with Cassandra 3.X in 2 datacenters, 1 Kong node per data center. We run a 6 node Cassandra cluster, 3 C* nodes per datacenter. The errors only occurred in a single datacenter on the Kong node in our examples above, but both datacenters share identical settings. We redeploy Kong on a weekly basis every Monday early AM, but this error presented above Started on a Wednesday so we can't correlate the problem to any sort of Kong startup issue.

To us the behavior points to cache rebuilding during new resource creation based on what we can correlate. Sadly nothing in Kong logging catches anything of interest when we notice issue presenting itself. Also note it does not happen every time obviously so its a very hard issue to nail down.

We also notice the issue correcting itself too at times, we have not traced the correction to anything specific just yet, but I assume very likely its when further services and routes are created after the errors are occurring and what seems to be a problematic worker process has its router cleaned up again.

Other points I can make are that production has not seen this issue with identical Kong configurations and architecture. But production has fewer proxies and has new services or routes added at a much lower frequency(1-2 per week vs 20+ in non-prod).

I wonder at this time if it may be also safer for us to switch cache TTL back from the 0 infinity value to some arbitrary number of hours to force cycle on the resources. I suppose if it is indeed the cache as we suspect that that would actually make the frequency of this issue more prevalent possibly though.

I may write a Kong health script that just arbitrarily grabs all routes on the gateway and calls each 1 one by 1 to ensure they don't return a 404 as a sanity check to run daily too right now. My biggest fear is as production grows in size and/or higher frequency in services/routes created daily we may begin to see the issue present itself there as well and that would cause dramatic impact to existing priority services if they start to 404 respond due to Kong not recognizing the proxy route exists in the db and caching appropriately.

Sorry I could not provide a 100% reproducible scenario for this situation, can only go off the analytics we have. Although if it yields some underlying bug in how Kong currently manages services and routes, that would bring huge stability to the core product.

Additional Details & Logs

{
  "plugins": {
    "enabled_in_cluster": [
      "correlation-id",
      "acl",
      "kong-oidc-multi-idp",
      "oauth2",
      "rate-limiting",
      "cors",
      "jwt",
      "kong-spec-expose",
      "request-size-limiting",
      "kong-response-size-limiting",
      "request-transformer",
      "request-termination",
      "kong-error-log",
      "kong-oidc-implicit-token",
      "kong-splunk-log",
      "kong-upstream-jwt",
      "kong-cluster-drain",
      "statsd"
    ],
    "available_on_server": {
      "kong-path-based-routing": true,
      "kong-spec-expose": true,
      "kong-cluster-drain": true,
      "correlation-id": true,
      "statsd": true,
      "jwt": true,
      "cors": true,
      "kong-oidc-multi-idp": true,
      "kong-response-size-limiting": true,
      "kong-oidc-auth": true,
      "kong-upstream-jwt": true,
      "kong-error-log": true,
      "request-termination": true,
      "request-size-limiting": true,
      "acl": true,
      "rate-limiting": true,
      "kong-service-virtualization": true,
      "request-transformer": true,
      "kong-oidc-implicit-token": true,
      "kong-splunk-log": true,
      "oauth2": true
    }
  },
  "tagline": "Welcome to kong",
  "configuration": {
    "plugins": [
      "kong-error-log",
      "kong-oidc-implicit-token",
      "kong-response-size-limiting",
      "cors",
      "request-transformer",
      "kong-service-virtualization",
      "kong-cluster-drain",
      "kong-upstream-jwt",
      "kong-splunk-log",
      "kong-spec-expose",
      "kong-oidc-auth",
      "kong-path-based-routing",
      "kong-oidc-multi-idp",
      "correlation-id",
      "oauth2",
      "statsd",
      "jwt",
      "rate-limiting",
      "acl",
      "request-size-limiting",
      "request-termination"
    ],
    "admin_ssl_enabled": false,
    "lua_ssl_verify_depth": 3,
    "trusted_ips": [
      "0.0.0.0/0",
      "::/0"
    ],
    "lua_ssl_trusted_certificate": "/usr/local/kong/ssl/kongcert.pem",
    "loaded_plugins": {
      "kong-path-based-routing": true,
      "kong-spec-expose": true,
      "kong-cluster-drain": true,
      "correlation-id": true,
      "statsd": true,
      "jwt": true,
      "cors": true,
      "rate-limiting": true,
      "kong-response-size-limiting": true,
      "kong-oidc-auth": true,
      "kong-upstream-jwt": true,
      "acl": true,
      "oauth2": true,
      "kong-splunk-log": true,
      "kong-oidc-implicit-token": true,
      "kong-error-log": true,
      "kong-service-virtualization": true,
      "request-transformer": true,
      "kong-oidc-multi-idp": true,
      "request-size-limiting": true,
      "request-termination": true
    },
    "cassandra_username": "****",
    "admin_ssl_cert_csr_default": "/usr/local/kong/ssl/admin-kong-default.csr",
    "ssl_cert_key": "/usr/local/kong/ssl/kongprivatekey.key",
    "dns_resolver": {},
    "pg_user": "kong",
    "mem_cache_size": "1024m",
    "cassandra_data_centers": [
      "dc1:2",
      "dc2:3"
    ],
    "nginx_admin_directives": {},
    "cassandra_password": "******",
    "custom_plugins": {},
    "pg_host": "127.0.0.1",
    "nginx_acc_logs": "/usr/local/kong/logs/access.log",
    "proxy_listen": [
      "0.0.0.0:8000",
      "0.0.0.0:8443 ssl http2"
    ],
    "client_ssl_cert_default": "/usr/local/kong/ssl/kong-default.crt",
    "ssl_cert_key_default": "/usr/local/kong/ssl/kong-default.key",
    "dns_no_sync": false,
    "db_update_propagation": 5,
    "nginx_err_logs": "/usr/local/kong/logs/error.log",
    "cassandra_port": 9042,
    "dns_order": [
      "LAST",
      "SRV",
      "A",
      "CNAME"
    ],
    "dns_error_ttl": 1,
    "headers": [
      "latency_tokens"
    ],
    "dns_stale_ttl": 4,
    "nginx_optimizations": true,
    "database": "cassandra",
    "pg_database": "kong",
    "nginx_worker_processes": "auto",
    "lua_package_cpath": "",
    "admin_acc_logs": "/usr/local/kong/logs/admin_access.log",
    "lua_package_path": "./?.lua;./?/init.lua;",
    "nginx_pid": "/usr/local/kong/pids/nginx.pid",
    "upstream_keepalive": 120,
    "cassandra_contact_points": [
      "server8429",
      "server8431",
      "server8432",
      "server8433",
      "server8445",
      "server8428"
    ],
    "admin_access_log": "off",
    "client_ssl_cert_csr_default": "/usr/local/kong/ssl/kong-default.csr",
    "proxy_listeners": [
      {
        "ssl": false,
        "ip": "0.0.0.0",
        "proxy_protocol": false,
        "port": 8000,
        "http2": false,
        "listener": "0.0.0.0:8000"
      },
      {
        "ssl": true,
        "ip": "0.0.0.0",
        "proxy_protocol": false,
        "port": 8443,
        "http2": true,
        "listener": "0.0.0.0:8443 ssl http2"
      }
    ],
    "proxy_ssl_enabled": true,
    "proxy_access_log": "off",
    "ssl_cert_csr_default": "/usr/local/kong/ssl/kong-default.csr",
    "enabled_headers": {
      "latency_tokens": true,
      "X-Upstream-Status": false,
      "X-Proxy-Latency": true,
      "server_tokens": false,
      "Server": false,
      "Via": false,
      "X-Upstream-Latency": true
    },
    "cassandra_ssl": true,
    "cassandra_local_datacenter": "DC2",
    "db_resurrect_ttl": 30,
    "db_update_frequency": 5,
    "cassandra_consistency": "LOCAL_QUORUM",
    "client_max_body_size": "100m",
    "admin_error_log": "/dev/stderr",
    "pg_ssl_verify": false,
    "dns_not_found_ttl": 30,
    "pg_ssl": false,
    "cassandra_repl_factor": 1,
    "cassandra_lb_policy": "RequestDCAwareRoundRobin",
    "cassandra_repl_strategy": "SimpleStrategy",
    "nginx_kong_conf": "/usr/local/kong/nginx-kong.conf",
    "error_default_type": "text/plain",
    "nginx_http_directives": {},
    "real_ip_header": "X-Forwarded-For",
    "kong_env": "/usr/local/kong/.kong_env",
    "cassandra_schema_consensus_timeout": 10000,
    "dns_hostsfile": "/etc/hosts",
    "admin_listeners": [
      {
        "ssl": false,
        "ip": "0.0.0.0",
        "proxy_protocol": false,
        "port": 8001,
        "http2": false,
        "listener": "0.0.0.0:8001"
      }
    ],
    "ssl_ciphers": "ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256",
    "ssl_cert": "/usr/local/kong/ssl/kongcert.crt",
    "prefix": "/usr/local/kong",
    "admin_ssl_cert_key_default": "/usr/local/kong/ssl/admin-kong-default.key",
    "cassandra_ssl_verify": true,
    "db_cache_ttl": 0,
    "ssl_cipher_suite": "modern",
    "real_ip_recursive": "on",
    "proxy_error_log": "/dev/stderr",
    "client_ssl_cert_key_default": "/usr/local/kong/ssl/kong-default.key",
    "nginx_daemon": "off",
    "anonymous_reports": false,
    "cassandra_timeout": 5000,
    "nginx_proxy_directives": {},
    "pg_port": 5432,
    "log_level": "debug",
    "client_body_buffer_size": "50m",
    "client_ssl": false,
    "lua_socket_pool_size": 30,
    "admin_ssl_cert_default": "/usr/local/kong/ssl/admin-kong-default.crt",
    "cassandra_keyspace": "kong_stage",
    "ssl_cert_default": "/usr/local/kong/ssl/kong-default.crt",
    "nginx_conf": "/usr/local/kong/nginx.conf",
    "admin_listen": [
      "0.0.0.0:8001"
    ]
  },
  "version": "0.14.1",
  "node_id": "cf0c92a9-724d-4972-baed-599785cff5ed",
  "lua_version": "LuaJIT 2.1.0-beta3",
  "prng_seeds": {
    "pid: 73": 183184219419,
    "pid: 71": 114224634222,
    "pid: 72": 213242339120,
    "pid: 70": 218221808514,
    "pid: 69": 233240145991,
    "pid: 68": 231238177547
  },
  "timers": {
    "pending": 4,
    "running": 0
  },
  "hostname": "kong-62-hxqnq"
}
jeremyjpj0916 commented 5 years ago

We were able to reproduce again with predictability.

6 Worker processes too do note.

  1. Start heavy 100+ TPS load test on a given route path on your gateway, monitor the status codes coming off of it(use the termination plugin make it throw a 200 success or what have you to get high throughput to the Kong node).

  2. We already have a decent number of services and routes already in place on our gateway so understand we are making a new service and route where 130+ are already existing. But create a new service and route pair during that load test.

  3. Watch as the route taking the load at that instance in time will throw 404's from a certain # of worker processes. Use the correlation id plugin globally to get those worker as uuid#counter and you will generally see some % of worker processes seemed to have managed to rebuild router while under that specific load just fine, the other worker processes throw the Kong 404 no route available error during that window on the route under load.

What does this mean? I believe Kong needs to implement router rebuild mutex in place so as Kong figures out changes to routes it has a temp cache to leverage of all the "existing" routes during the window of rebuild(I think @p0pr0ck5 implemented a mutex like this around Service resource create/updates/deletes for Kong 1.0 ).

Some interesting thoughts trailing off on this though as well. IF it was just the instance in time of rebuild that had issue why did we see existing routes like our /F5/status endpoint persist the 404 no route found error with certain worker processes? Maybe a timing thing or Kong invalidly caches that bad "miss" as a truth and continues to persist that miss even if its present in the db? I am not sure, but I believe the mutex lock would probably stop that from happening anyways.

Let us know if we can provide you any further help but I think Kong will be able to reproduce at this point!

UPDATE: Was told in gitter chat that 1.0RC3 has the router mutex in place. Will see if that resolves the described issues here.

p0pr0ck5 commented 5 years ago

There's some interesting behavior in this that I don't know has been discussed before. The design goal of router rebuilds is to maintain a strongly consistent environment to avoid some worker processes sending different responses. Each worker maintains its own router, so rebuild events are done through the worker_events mechanism:

https://github.com/Kong/kong/blob/next/kong/runloop/handler.lua#L418-L421

These events are handled based on a polling mechanism, since there's no direct inter-worker IPC mechanism available at this time (well there is via a third part module, but nothing that's provided as a first-class OpenResty primative):

https://github.com/Kong/kong/blob/next/kong/init.lua#L432

At a given time, some worker processes may 'see' the event broadcast and act by forcing a router rebuild, and other worker processes will not yet have polled the shared zone. It's an incorrect assumption that all workers will handle an event at the exact same time. There will be some drift between workers handling events, so the behavior that you're seeing here is expected; it a nature of relying on a polling mechanism inside a shim to broadcast events among all workers. It's been long known that the current solution is sub-optimal for cases like this, but again, a true IPC solution requires some major under-the-hood work.

A node-wide router definition might solve some problems, but as discussed before raises other issues and concerns (memory limitations, deserialization, etc).

BTW, the router rebuild mutex would not solve this at all. The goal of that work is to avoid stampeding the database in high-concurrency traffic situations when a router rebuild is required, but it still relies on the same polling mechanism, so the race-y behavior seen here would likely still manifest.

jeremyjpj0916 commented 5 years ago

Uh oh. Well that is unfortunate, guess the best way to solve this would be an inter-worker IPC mechanism then. Maybe in the mean time there could be a patch in place until Kong/OpenResty can get there with the IPC, that in the background compares whats in the router cache vs db and invalidates the router if mismatches are detected maybe based on a configurable timer, maybe reconciles once a minute or so (I hate this approach so don't go with this lol)? Does not sound like a bug that can be fixed by 1.0 release then, but do keep us posted if anything comes out of it or an elegant alternative approach takes shape in the meantime.

Edit - Another idea could be worker processes reconciling service + routes against each other in a given interval setting on a per node basis over evaluating against the DB. Anything to help prevent the 404's from persisting would be my biggest concern until a better solution can be found.

I think for now we will limit changes to our Kong environments in prod to once nightly and use a script we whipped up to ensure no 404's are present post change.

jeremyjpj0916 commented 5 years ago

Updates: Was talking on Kong public gitter, and I mentioned that I had been unable to reproduce in dev with very few Kong Service+Route+Consumer+Plugin pairs. This held true, when I have fewer than say 10 total services and consumers and such I could never reproduce the error. I decided to ramp things up a tad by adding 500 Service+Route pairs, where the Route had the ACL plugin enabled with whitelist entry being the route id and Oauth2 plugin enabled, Also created a consumer to each route with an acl group tied to route id, as well as giving the consumer a pair of oauth2 and jwt creds for kicks(mimics how I handle all created consumers internally). After getting those resources in place to put a little meat on Kong's bones, I then ran a 100 TPS loadtest against our health check endpoint. It became easily producible then while adding service+route pairs during the load test to see the 404's once more, so it certainly seems Kong in its current form must have a breaking limit on # of resources the current internals can handle before things go a bit haywire. Granted I was not able in this short time of testing in my dev environment to "persist" the 404's like I have also seen occasionally in Stage(which stay throwing 404's until a service+route rebuild gets manually enforced), so a special sort of timing/race condition must have to occur that causes the 404 to persist. Each green bar spike of 404's are when when service/route pairs were created in dev manually.

devloadtesttonight

p0pr0ck5 commented 5 years ago

This almost sounds like a race condition when a new DAO iterator runs in multiple concurrent request handlers, because each page iteration results in thread-yielding network I/O. The end result would be that the request handler that writes the final version of the router cache value incorrectly assumed itself to have been completed. I haven't been able to track down exactly where that might occur, but it doesn't sound out of the realm of possibility at this point. In a wonderful showcase of hat eating, I suspect that the router rebuilding mutex in place in the release candidates actually would solve this, because only one instance of the iterator would be executing at a given time inside a worker.

@jeremyjpj0916 have you had a crack at testing this against 1.0 RC2/RC3 yet?

jeremyjpj0916 commented 5 years ago

πŸ˜„ @p0pr0ck5 don't be giving me some false hope now hah. I was holding off testing RC3 after your first assessment stated you did not believe the mutex to be a fix for this behavior, but since it sounds like you think it may help I am pushing that testing forward to today as I consider understanding the issue critical to our Kong deployments stability(as our production will grow in resource/proxy count and see all the problems stage presents pretty regularly at the moment). Will post back here by late afternoon/tonight with my results of that testing!

Edit - With rc3 throwing errors will await rc4's docker-kong git release and resume analysis then!

p0pr0ck5 commented 5 years ago

@jeremyjpj0916 FWIW the router mutex logic made its way into RC2, so if you want to test on RC2 that would suffice for context here.

jeremyjpj0916 commented 5 years ago

@p0pr0ck5 good point, I will give that a go in an hour or so actually.

Edit - nvrmind, I just remembered rc2 is where the migrations issue occurs where acl groups as a route uuid's(how we do thing internally) breaks the migrations commands(Thibaultcha patched that into the rc3 release). So guess ill wait on rc4 still πŸŽ… (if yall want to fix up RC2 or 3 with newer patches I can test that as well but might be better to focus on rc4 unless you consider me testing this a priority before 1.0).

Edit Edit - well look at that 1.0rc4 is here :D, Christmas came early. If yall get that updated here I will test it asap: https://github.com/Kong/docker-kong

Docker Kong info belowwww:

ENV KONG_VERSION 1.0.0rc4 ENV KONG_SHA256 b65c7c01a4d523b1687898c05793aae71645e8ba0b89d59343c157f4f88db7dd

jeremyjpj0916 commented 5 years ago

@thibaultcha @p0pr0ck5 Bad news folks, 1.0.0rc4 actually causes worse impact around this.

100 TPS load test against our healthcheck endpoint yielded 100's of these this in the logs on the node taking the load(and even after load test finished the errors continually print every subsequent call too), Note every 4 seconds or so we added a route to an arbitrary service not currently under the load test on the same node being load tested. We then checked the other datacenter that was not taking any traffic and somehow it also ended up with these in its logs(albeit fewer of them, EDIT - but growing too now post loadtest) :

2018/12/12 00:35:11 [error] 145#0: *48178 [lua] responses.lua:128: before(): no router to route request (reason: error attempting to acquire build_router lock: timeout), client: 10.***.**.*, server: kong, request: "GET /F5/status HTTP/1.1", host: "gateway-dev-core-***.*****.***"

Rather than 404's the errors manifest themselves in logs as 500 now: newloadtestwhilemodifyingroutes

Gave a quick try in terminal too on the route path in the other data center on a separate node as well as we noticed it went down, meaning it started failing its health checks too with 500's(we were not even load testing this specific node nor do we have cluster wide rate limiting as a global or anything): localhostcurlinotherdc

I find the individual tx log interesting, why TotalLatency reported as 5 seconds on the dot? Any relation to KONG_CASSANDRA_TIMEOUT being set to 5000 in our environments? Note all my original config info from the admin API dump in my original post still holds true for the 1.0.0rc4 testing.

Findings are as follows, we start seeing 500's under load and it didn't take long till those errors in the stdout started printing by the 100's and overwhelmed not only the node taking load test traffic in one datacenter, but somehow the other node in a totally different data center as well not taking any traffic besides 1 ping call every 3-5 seconds (the /F5/status endpoint just uses just the request termination plugin set to 200 success remember) was failing.

Then still 10-20 minutes after the load test both of my nodes in either dc are unable to serve traffic on any routes, all of them throw the:

{"message":"An unexpected error occurred"}

And every call to any route like so above takes 5 seconds to respond it seems for that error.

And the STDOUT logs reveal the same no router to route request (reason: error attempting to acquire build_router lock: timeout) on each of those calls.

20 minutes later I decided to see if there was anything in the db blowing up, locks was empty at the time and seemed like a new table. Nothing stood out to me major at a glance.

If you need any further dumps or analysis let me know. I think this should be easily producible if you run a load test of 100 tps against an endpoint and make some changes to routes on your test node. What really stumps me is how does the independent node in the other dc not even taking traffic get similar rekt status lol, some sort of introduced internode-dependency??

Other things I just noticed, in the node taking the loadtest vs node in other dc just chilling,

2018/12/12 00:35:11 [error] 141#0 through 2018/12/12 00:35:11 [error] 146#0

vs

2018/12/12 01:33:46 [error] 119#0: 2018/12/12 01:33:46 [error] 122#0:

All 6 worker processes in the node under load are throwing the error still in stdout, in the other DC node that was not load tested only 2 of the 6 are throwing the error.

bungle commented 5 years ago

I have seen this happening when running our testsuite. I have a feeling that it might be paging related.

bungle commented 5 years ago

I will try something tomorrow and ping you back.

thibaultcha commented 5 years ago

When I was trying to reproduce this in the plane earlier, I hit a couple of other issues with the DAO that I am in the process of fixing. Unfortunately, traveling, and other thhings are getting in my way these days.

Those issues that I found should be mitigated by the router mutex though.

That said, I was not able to replicate your observations Jeremy. But as I am about to board another 12h flight, who knows, maybe I will have another chance at it, or @bungle will!

Thanks for giving rc4 a spin.

jeremyjpj0916 commented 5 years ago

Looking forward to any improvements that can be made. Would love for Kong 1.0 to have the ability to be configured on the fly with no availability impact to active traffic(or persisted impact), and to do so with a few thousand proxies and consumers ideally. If I have any more findings will post them here, or if you have any lua file changes or config changes you want me to give a go let me know!

I have a high degree of confidence yall will get it right with some tinkering 😸 .

On a positive note I did notice the C client connections when router changes were being made during load did not jump nearly as dramatically nor did the C read latency results as compared to 0.14.1 vs newer 1.0.0rc4 so some improvements seem to showing at the high level.

p0pr0ck5 commented 5 years ago

Leaving a very quick note here as it’s late and today I spent fire-fighting elsewhere. @jeremyjpj0916 thanks immensely for all your efforts in testing this. I see a bug in the router rebuild mutex handling that was introduced after the original change was committed. I will be digging into this tomorrow.

jeremyjpj0916 commented 5 years ago

No thanks necessary, I am overjoyed with the level of engagement you all have with the community, and its really awesome seeing the team iterate and improve on the product! Cheers to 1.0 nearing release as well as Kong's growth into 2019! Let me know when ready to test again and I will give it a go.

bungle commented 5 years ago

I am not sure if this is an another issue, but I will leave a comment here too:

Imagine you are rebuilding routes and rebuilder is on page 3. Someone adds a route or deletes a route on page 1. Now rebuilder should stop immediately and start again (I guess it does not do it?). Otherwise rebuilder itself produces a stale router, at least for requests that it has queued or at very least for the one request that is rebuilding the router currently (possibly per worker)

bungle commented 5 years ago

@jeremyjpj0916 there is now pr #4084, we would greatly appreciate if you can check what effect it has on this issue.

thibaultcha commented 5 years ago

4084 looks good. We should also fallback to rebuilding the router upon semaphore timeout (which the PR will definitely help preventing, but still a good safety measure to have).

That said, the original issue was in the latest stable (0.14.1), so unless the router rebuild semaphore circumvents the issue, it is likely still there... I disabled the router semaphore on 1.0rc3 and ran a number of tests, but was unable to reproduce the issue you are seeing @jeremyjpj0916 (producing a 404 - router miss - upon router rebuild when a correct Route should be matched). Can you share a script or something creating the Routes to test this against? I get that the number of Routes should be in the hundreds, but that did not seem to trigger any issue on my end, so some specific steps must be missing like order in which they are created, paths to match, etc...

jeremyjpj0916 commented 5 years ago

@bungle Just woke up, on PTO today and tomorrow but I intend to spend some time this afternoon/evening and hack in those changes to the rc4 release and give it a go as I am super interested in it. Thanks for putting this together!

EDIT: So I am about to test here in a bit finishing my lunch, but essentially after I pull down the rc4 I am doing this in my build path(that handler.lua is the one from your PR):

COPY patches/handler.lua /usr/local/share/lua/5.1/kong/runloop/handler.lua
RUN chmod 777 /usr/local/share/lua/5.1/kong/runloop/handler.lua

Will give it a go!

@thibaultcha

I disabled the router semaphore on 1.0rc3 and ran a number of tests, but was unable to reproduce the issue you are seeing @jeremyjpj0916 (producing a 404 - router miss - upon router rebuild when a correct Route should be matched).

In the newer release candidates it presented itself with 500 level errors like my splunk log showed above.

I think I can whip up something in python that will create them all for yah as well as something to run live during the load test. My current logic is java but its a mess and not easy to break out.

Could the reason you are having trouble reproducing be because your Kong node and db are all just localhost with no replication? We have to do replication factor of 3 so all nodes in a dc hold all data because without it OAuth token generation has issue under load(we went through those discussions awhile back). I am curious if there was a way to add simulated latency in to 20-30ms range for read/writes from C*(probably close to realistic for what I see under load, read sometimes actually spikes to in the 100-800ms ranges on 0.14.1) and if the issues in my real environments would become apparent in your test env.

thibaultcha commented 5 years ago

In the newer release candidates it presented itself with 500 level errors like my splunk log showed above.

That is the other issue related to the router semaphore fixed by #4084. My goal was to reproduce the behaviour you were originally describing here, but on the latest 1.0 branch, on which I needed to disable the router semaphore to mimic 0.14.

I am running a 3 nodes local cluster, but will increase RF to 3 and give it a another shot. With a QUORUM consistency I suppose.

jeremyjpj0916 commented 5 years ago

@thibaultcha ah I see what you mean now.

@bungle just tried your patch on this version

ENV KONG_VERSION 1.0.0rc4
ENV KONG_SHA256 b65c7c01a4d523b1687898c05793aae71645e8ba0b89d59343c157f4f88db7dd

I am seeing this error in runtime:

2018/12/13 20:56:48 [error] 22#0: init_by_lua error: /usr/local/share/lua/5.1/kong/runloop/handler.lua:14: module 'kong.api_router' not found:No LuaRocks module found for kong.api_router
    no field package.preload['kong.api_router']
    no file './kong/api_router.lua'
    no file './kong/api_router/init.lua'
    no file '/usr/local/openresty/site/lualib/kong/api_router.ljbc'
    no file '/usr/local/openresty/site/lualib/kong/api_router/init.ljbc'
    no file '/usr/local/openresty/lualib/kong/api_router.ljbc'
    no file '/usr/local/openresty/lualib/kong/api_router/init.ljbc'
    no file '/usr/local/openresty/site/lualib/kong/api_router.lua'
    no file '/usr/local/openresty/site/lualib/kong/api_router/init.lua'
    no file '/usr/local/openresty/lualib/kong/api_router.lua'
    no file '/usr/local/openresty/lualib/kong/api_router/init.lua'
    no file '/usr/local/openresty/luajit/share/luajit-2.1.0-beta3/kong/api_router.lua'
    no file '/usr/local/share/lua/5.1/kong/api_router.lua'
    no file '/usr/local/share/lua/5.1/kong/api_router/init.lua'
    no file '/usr/local/openresty/luajit/share/lua/5.1/kong/api_router.lua'
    no file '/usr/local/openresty/luajit/share/lua/5.1/kong/api_router/init.lua'
    no file '/.luarocks/share/lua/5.1/kong/api_router.lua'
    no file '/.luarocks/share/lua/5.1/kong/api_router/init.lua'
    no file '/usr/local/openresty/site/lualib/kong/api_router.so'
    no file '/usr/local/openresty/lualib/kong/api_router.so'
    no file './kong/api_router.so'
    no file '/usr/local/lib/lua/5.1/kong/api_router.so'
    no file '/usr/local/openresty/luajit/lib/lua/5.1/kong/api_router.so'
    no file '/usr/local/lib/lua/5.1/loadall.so'
    no file '/.luarocks/lib/lua/5.1/kong/api_router.so'
    no file '/usr/local/openresty/site/lualib/kong.so'
    no file '/usr/local/openresty/lualib/kong.so'
    no file './kong.so'
    no file '/usr/local/lib/lua/5.1/kong.so'
    no file '/usr/local/openresty/luajit/lib/lua/5.1/kong.so'
    no file '/usr/local/lib/lua/5.1/loadall.so'
    no file '/.luarocks/lib/lua/5.1/kong.so'
stack traceback:
    [C]: in function 'require'
    /usr/local/share/lua/5.1/kong/runloop/handler.lua:14: in main chunk
    [C]: in function 'require'
    /usr/local/share/lua/5.1/kong/init.lua:71: in main chunk
    [C]: in function 'require'
    init_by_lua:2: in main chunk
nginx: [error] init_by_lua error: /usr/local/share/lua/5.1/kong/runloop/handler.lua:14: module 'kong.api_router' not found:No LuaRocks module found for kong.api_router
    no field package.preload['kong.api_router']
    no file './kong/api_router.lua'
    no file './kong/api_router/init.lua'
    no file '/usr/local/openresty/site/lualib/kong/api_router.ljbc'
    no file '/usr/local/openresty/site/lualib/kong/api_router/init.ljbc'
    no file '/usr/local/openresty/lualib/kong/api_router.ljbc'
    no file '/usr/local/openresty/lualib/kong/api_router/init.ljbc'
    no file '/usr/local/openresty/site/lualib/kong/api_router.lua'
    no file '/usr/local/openresty/site/lualib/kong/api_router/init.lua'
    no file '/usr/local/openresty/lualib/kong/api_router.lua'
    no file '/usr/local/openresty/lualib/kong/api_router/init.lua'
    no file '/usr/local/openresty/luajit/share/luajit-2.1.0-beta3/kong/api_router.lua'
    no file '/usr/local/share/lua/5.1/kong/api_router.lua'
    no file '/usr/local/share/lua/5.1/kong/api_router/init.lua'
    no file '/usr/local/openresty/luajit/share/lua/5.1/kong/api_router.lua'
    no file '/usr/local/openresty/luajit/share/lua/5.1/kong/api_router/init.lua'
    no file '/.luarocks/share/lua/5.1/kong/api_router.lua'
    no file '/.luarocks/share/lua/5.1/kong/api_router/init.lua'
    no file '/usr/local/openresty/site/lualib/kong/api_router.so'
    no file '/usr/local/openresty/lualib/kong/api_router.so'
    no file './kong/api_router.so'
    no file '/usr/local/lib/lua/5.1/kong/api_router.so'
    no file '/usr/local/openresty/luajit/lib/lua/5.1/kong/api_router.so'
    no file '/usr/local/lib/lua/5.1/loadall.so'
    no file '/.luarocks/lib/lua/5.1/kong/api_router.so'
    no file '/usr/local/openresty/site/lualib/kong.so'
    no file '/usr/local/openresty/lualib/kong.so'
    no file './kong.so'
    no file '/usr/local/lib/lua/5.1/kong.so'
    no file '/usr/local/openresty/luajit/lib/lua/5.1/kong.so'
    no file '/usr/local/lib/lua/5.1/loadall.so'
    no file '/.luarocks/lib/lua/5.1/kong.so'
stack traceback:
    [C]: in function 'require'
    /usr/local/share/lua/5.1/kong/runloop/handler.lua:14: in main chunk
    [C]: in function 'require'
    /usr/local/share/lua/5.1/kong/init.lua:71: in main chunk
    [C]: in function 'require'
init_by_lua:2: in main chunk

Is the handler you patched still including deprecated api resources? I have to test against the rc4 for my dev environment

EDIT- Ahh yep seems it is as I glance the code, I will just rid the patched handler of all deprecated references in my local git and test.

thibaultcha commented 5 years ago

The patch was created on top of the next branch, which still has APIs and such, while rc4 is cut from the release branch, which has additional commits deleting APIs. So the patch must be appropriately tweaks for it to work in rc4, one would not be able to copy-paste the file as-is.

jeremyjpj0916 commented 5 years ago

I ran through most the code quick and gutted what I believe to be right as long as underlying method calls drop that ,api) passed var are still taking parameters in the same way:

Jeremy hacked handler:

-- Kong runloop
--
-- This consists of local_events that need to
-- be ran at the very beginning and very end of the lua-nginx-module contexts.
-- It mainly carries information related to a request from one context to the next one,
-- through the `ngx.ctx` table.
--
-- In the `access_by_lua` phase, it is responsible for retrieving the route being proxied by
-- a consumer. Then it is responsible for loading the plugins to execute on this request.
local ck          = require "resty.cookie"
local meta        = require "kong.meta"
local utils       = require "kong.tools.utils"
local Router      = require "kong.router"
local reports     = require "kong.reports"
local balancer    = require "kong.runloop.balancer"
local mesh        = require "kong.runloop.mesh"
local constants   = require "kong.constants"
local semaphore   = require "ngx.semaphore"
local responses   = require "kong.tools.responses"
local singletons  = require "kong.singletons"
local certificate = require "kong.runloop.certificate"

local kong        = kong
local tostring    = tostring
local tonumber    = tonumber
local sub         = string.sub
local lower       = string.lower
local fmt         = string.format
local sort        = table.sort
local ngx         = ngx
local log         = ngx.log
local ngx_now     = ngx.now
local update_time = ngx.update_time
local subsystem   = ngx.config.subsystem
local unpack      = unpack

local ERR         = ngx.ERR
local DEBUG       = ngx.DEBUG

local CACHE_ROUTER_OPTS = { ttl = 0 }
local EMPTY_T = {}

local get_router, build_router
local server_header = meta._SERVER_TOKENS

local build_router_semaphore

local function get_now()
  update_time()
  return ngx_now() * 1000 -- time is kept in seconds with millisecond resolution.
end

do
  -- Given a protocol, return the subsystem that handles it
  local protocol_subsystem = {
    http = "http",
    https = "http",
    tcp = "stream",
    tls = "stream",
  }

  local router
  local router_version

  build_router = function(db, version)
    local routes, i = {}, 0

    for route, err in db.routes:each(1000) do
      if err then
        return nil, "could not load routes: " .. err
      end

      local service_pk = route.service

      if not service_pk then
        return nil, "route (" .. route.id .. ") is not associated with service"
      end

      local service, err = db.services:select(service_pk)
      if not service then
        return nil, "could not find service for route (" .. route.id .. "): " ..
                    err
      end

      local stype = protocol_subsystem[service.protocol]
      if subsystem == stype then
        local r = {
          route   = route,
          service = service,
        }

        if stype == "http" and route.hosts then
          -- TODO: headers should probably be moved to route
          r.headers = {
            host = route.hosts,
          }
        end

        i = i + 1
        routes[i] = r
      end
    end

    sort(routes, function(r1, r2)
      r1, r2 = r1.route, r2.route
      if r1.regex_priority == r2.regex_priority then
        return r1.created_at < r2.created_at
      end
      return r1.regex_priority > r2.regex_priority
    end)

    local err
    router, err = Router.new(routes)
    if not router then
      return nil, "could not create router: " .. err
    end

    if version then
      router_version = version
    end

    singletons.router = router

    return true
  end

  local function check_router_rebuild()
    -- we might not need to rebuild the router (if we were not
    -- the first request in this process to enter this code path)
    -- check again and rebuild only if necessary
    local version, err = singletons.cache:get("router:version",
                                              CACHE_ROUTER_OPTS,
                                              utils.uuid)
    if err then
      log(ngx.CRIT, "could not ensure router is up to date: ", err)
      return nil, err
    end

    if version == router_version then
      return true
    end

    -- router needs to be rebuilt in this worker
    log(DEBUG, "rebuilding router")

    local ok, err = build_router(singletons.db, version)
    if not ok then
      log(ngx.CRIT, "could not rebuild router: ", err)
      return nil, err
    end

    return true
  end

  get_router = function()
    local version, err = singletons.cache:get("router:version",
                                              CACHE_ROUTER_OPTS,
                                              utils.uuid)
    if err then
      log(ngx.CRIT, "could not ensure router is up to date: ", err)
      return nil, err
    end

    if version == router_version then
      return router
    end

    -- wrap router rebuilds in a per-worker mutex (via ngx.semaphore)
    -- this prevents dogpiling the database during rebuilds in
    -- high-concurrency traffic patterns
    -- requests that arrive on this process during a router rebuild will be
    -- queued. once the semaphore resource is acquired we re-check the
    -- router version again to prevent unnecessary subsequent rebuilds

    local timeout = 60
    if singletons.configuration.database == "cassandra" then
      -- cassandra_timeout is defined in ms
      timeout = singletons.configuration.cassandra_timeout / 1000

    elseif singletons.configuration.database == "postgres" then
      -- pg_timeout is defined in ms
      timeout = singletons.configuration.pg_timeout / 1000
    end

    -- acquire lock
    local ok, err = build_router_semaphore:wait(timeout)
    if not ok then
      return nil, "error attempting to acquire build_router lock: " .. err
    end

    local pok
    pok, ok, err = pcall(check_router_rebuild)

    -- release lock
    build_router_semaphore:post(1)

    if not pok then
      return nil, ok
    end

    if not ok then
      return nil, err
    end

    return router
  end
end

local function balancer_setup_stage1(ctx, scheme, host_type, host, port,
                                     service, route)
  local balancer_data = {
    scheme         = scheme,    -- scheme for balancer: http, https
    type           = host_type, -- type of 'host': ipv4, ipv6, name
    host           = host,      -- target host per `upstream_url`
    port           = port,      -- final target port
    try_count      = 0,         -- retry counter
    tries          = {},        -- stores info per try
    ssl_ctx        = kong.default_client_ssl_ctx, -- SSL_CTX* to use
    -- ip          = nil,       -- final target IP address
    -- balancer    = nil,       -- the balancer object, if any
    -- hostname    = nil,       -- hostname of the final target IP
    -- hash_cookie = nil,       -- if Upstream sets hash_on_cookie
  }

  -- TODO: this is probably not optimal
  do
    local retries = service.retries
    if retries then
      balancer_data.retries = retries

    else
      balancer_data.retries = 5
    end

    local connect_timeout = service.connect_timeout

    if connect_timeout then
      balancer_data.connect_timeout = connect_timeout

    else
      balancer_data.connect_timeout = 60000
    end

    local send_timeout = service.write_timeout 

    if send_timeout then
      balancer_data.send_timeout = send_timeout

    else
      balancer_data.send_timeout = 60000
    end

    local read_timeout = service.read_timeout 

    if read_timeout then
      balancer_data.read_timeout = read_timeout

    else
      balancer_data.read_timeout = 60000
    end
  end

  -- TODO: this needs to be removed when references to ctx.api are removed
  ctx.service          = service
  ctx.route            = route
  ctx.balancer_data    = balancer_data
  ctx.balancer_address = balancer_data -- for plugin backward compatibility
end

local function balancer_setup_stage2(ctx)
  local balancer_data = ctx.balancer_data

  do -- Check for KONG_ORIGINS override
    local origin_key = balancer_data.scheme .. "://" ..
                       utils.format_host(balancer_data)
    local origin = singletons.origins[origin_key]
    if origin then
      balancer_data.scheme = origin.scheme
      balancer_data.type = origin.type
      balancer_data.host = origin.host
      balancer_data.port = origin.port
    end
  end

  local ok, err, errcode = balancer.execute(balancer_data, ctx)
  if not ok and errcode == 500 then
    err = "failed the initial dns/balancer resolve for '" ..
          balancer_data.host .. "' with: " .. tostring(err)
  end

  return ok, err, errcode
end

-- in the table below the `before` and `after` is to indicate when they run:
-- before or after the plugins
return {
  build_router     = build_router,

  init_worker = {
    before = function()
      reports.init_worker()

      -- initialize local local_events hooks
      local db             = singletons.db
      local dao            = singletons.dao
      local cache          = singletons.cache
      local worker_events  = singletons.worker_events
      local cluster_events = singletons.cluster_events

      -- get a cache key using either the old or new DAO
      local function get_cache_key(data, entity)
        if data.schema.table then
          return dao[data.schema.table]:entity_cache_key(entity)
        else
          return db[data.schema.name]:cache_key(entity)
        end
      end

      -- events dispatcher

      worker_events.register(function(data)
        if not data.schema then
          log(ngx.ERR, "[events] missing schema in crud subscriber")
          return
        end

        if not data.entity then
          log(ngx.ERR, "[events] missing entity in crud subscriber")
          return
        end

        -- invalidate this entity anywhere it is cached if it has a
        -- caching key

        local cache_key = get_cache_key(data, data.entity)

        if cache_key then
          cache:invalidate(cache_key)
        end

        -- if we had an update, but the cache key was part of what was updated,
        -- we need to invalidate the previous entity as well

        if data.old_entity then
          cache_key = get_cache_key(data, data.old_entity)
          if cache_key then
            cache:invalidate(cache_key)
          end
        end

        if not data.operation then
          log(ngx.ERR, "[events] missing operation in crud subscriber")
          return
        end

        -- public worker events propagation

        local entity_channel           = data.schema.table or data.schema.name
        local entity_operation_channel = fmt("%s:%s", entity_channel,
                                             data.operation)

        -- crud:routes
        local _, err = worker_events.post_local("crud", entity_channel, data)
        if err then
          log(ngx.ERR, "[events] could not broadcast crud event: ", err)
          return
        end

        -- crud:routes:create
        _, err = worker_events.post_local("crud", entity_operation_channel, data)
        if err then
          log(ngx.ERR, "[events] could not broadcast crud event: ", err)
          return
        end
      end, "dao:crud")

      -- local events (same worker)

      worker_events.register(function()
        log(DEBUG, "[events] Route updated, invalidating router")
        cache:invalidate("router:version")
      end, "crud", "routes")

      worker_events.register(function(data)
        if data.operation ~= "create" and
           data.operation ~= "delete"
        then
          -- no need to rebuild the router if we just added a Service
          -- since no Route is pointing to that Service yet.
          -- ditto for deletion: if a Service if being deleted, it is
          -- only allowed because no Route is pointing to it anymore.
          log(DEBUG, "[events] Service updated, invalidating router")
          cache:invalidate("router:version")
        end
      end, "crud", "services")

      worker_events.register(function(data)
        -- assume an update doesnt also change the whole entity!
        if data.operation ~= "update" then
          log(DEBUG, "[events] Plugin updated, invalidating plugins map")
          cache:invalidate("plugins_map:version")
        end
      end, "crud", "plugins")

      -- SSL certs / SNIs invalidations

      worker_events.register(function(data)
        log(DEBUG, "[events] SNI updated, invalidating cached certificates")
        local sn = data.entity

        cache:invalidate("certificates:" .. sn.name)
      end, "crud", "snis")

      worker_events.register(function(data)
        log(DEBUG, "[events] SSL cert updated, invalidating cached certificates")
        local certificate = data.entity

        for sn, err in db.snis:each_for_certificate({ id = certificate.id }) do
          if err then
            log(ERR, "[events] could not find associated snis for certificate: ",
                     err)
            break
          end

          cache:invalidate("certificates:" .. sn.name)
        end
      end, "crud", "certificates")

      -- target updates

      -- worker_events local handler: event received from DAO
      worker_events.register(function(data)
        local operation = data.operation
        local target = data.entity
        -- => to worker_events node handler
        local ok, err = worker_events.post("balancer", "targets", {
          operation = data.operation,
          entity = data.entity,
        })
        if not ok then
          log(ERR, "failed broadcasting target ",
              operation, " to workers: ", err)
        end
        -- => to cluster_events handler
        local key = fmt("%s:%s", operation, target.upstream.id)
        ok, err = cluster_events:broadcast("balancer:targets", key)
        if not ok then
          log(ERR, "failed broadcasting target ", operation, " to cluster: ", err)
        end
      end, "crud", "targets")

      -- worker_events node handler
      worker_events.register(function(data)
        local operation = data.operation
        local target = data.entity

        -- => to balancer update
        balancer.on_target_event(operation, target)
      end, "balancer", "targets")

      -- cluster_events handler
      cluster_events:subscribe("balancer:targets", function(data)
        local operation, key = unpack(utils.split(data, ":"))
        -- => to worker_events node handler
        local ok, err = worker_events.post("balancer", "targets", {
          operation = operation,
          entity = {
            upstream = { id = key },
          }
        })
        if not ok then
          log(ERR, "failed broadcasting target ", operation, " to workers: ", err)
        end
      end)

      -- manual health updates
      cluster_events:subscribe("balancer:post_health", function(data)
        local pattern = "([^|]+)|([^|]+)|([^|]+)|([^|]+)|(.*)"
        local ip, port, health, id, name = data:match(pattern)
        port = tonumber(port)
        local upstream = { id = id, name = name }
        local ok, err = balancer.post_health(upstream, ip, port, health == "1")
        if not ok then
          log(ERR, "failed posting health of ", name, " to workers: ", err)
        end
      end)

      -- upstream updates

      -- worker_events local handler: event received from DAO
      worker_events.register(function(data)
        local operation = data.operation
        local upstream = data.entity
        -- => to worker_events node handler
        local ok, err = worker_events.post("balancer", "upstreams", {
          operation = data.operation,
          entity = data.entity,
        })
        if not ok then
          log(ERR, "failed broadcasting upstream ",
              operation, " to workers: ", err)
        end
        -- => to cluster_events handler
        local key = fmt("%s:%s:%s", operation, upstream.id, upstream.name)
        ok, err = cluster_events:broadcast("balancer:upstreams", key)
        if not ok then
          log(ERR, "failed broadcasting upstream ", operation, " to cluster: ", err)
        end
      end, "crud", "upstreams")

      -- worker_events node handler
      worker_events.register(function(data)
        local operation = data.operation
        local upstream = data.entity

        -- => to balancer update
        balancer.on_upstream_event(operation, upstream)
      end, "balancer", "upstreams")

      cluster_events:subscribe("balancer:upstreams", function(data)
        local operation, id, name = unpack(utils.split(data, ":"))
        -- => to worker_events node handler
        local ok, err = worker_events.post("balancer", "upstreams", {
          operation = operation,
          entity = {
            id = id,
            name = name,
          }
        })
        if not ok then
          log(ERR, "failed broadcasting upstream ", operation, " to workers: ", err)
        end
      end)

      -- initialize balancers for active healthchecks
      ngx.timer.at(0, function()
        balancer.init()
      end)

      do
        local err

        build_router_semaphore, err = semaphore.new()
        if err then
          log(ngx.CRIT, "failed to create build_router_semaphore: ", err)
        end

        build_router_semaphore:post(1)
      end
    end
  },
  certificate = {
    before = function(_)
      certificate.execute()
    end
  },
  rewrite = {
    before = function(ctx)
      ctx.KONG_REWRITE_START = get_now()
      mesh.rewrite()
    end,
    after = function(ctx)
      ctx.KONG_REWRITE_TIME = get_now() - ctx.KONG_REWRITE_START -- time spent in Kong's rewrite_by_lua
    end
  },
  preread = {
    before = function(ctx)
      local router, err = get_router()
      if not router then
        log(ERR, "no router to route connection (reason: " .. err .. ")")
        return ngx.exit(500)
      end

      local match_t = router.exec(ngx)
      if not match_t then
        log(ERR, "no Route found with those values")
        return ngx.exit(500)
      end

      local var = ngx.var

      local ssl_termination_ctx -- OpenSSL SSL_CTX to use for termination

      local ssl_preread_alpn_protocols = var.ssl_preread_alpn_protocols
      -- ssl_preread_alpn_protocols is a comma separated list
      -- see https://trac.nginx.org/nginx/ticket/1616
      if ssl_preread_alpn_protocols and
         ssl_preread_alpn_protocols:find(mesh.get_mesh_alpn(), 1, true) then
        -- Is probably an incoming service mesh connection
        -- terminate service-mesh Mutual TLS
        ssl_termination_ctx = mesh.mesh_server_ssl_ctx
        ctx.is_service_mesh_request = true
      else
        -- TODO: stream router should decide if TLS is terminated or not
        -- XXX: for now, use presence of SNI to terminate.
        local sni = var.ssl_preread_server_name
        if sni then
          ngx.log(ngx.DEBUG, "SNI: ", sni)

          local err
          ssl_termination_ctx, err = certificate.find_certificate(sni)
          if not ssl_termination_ctx then
            ngx.log(ngx.ERR, err)
            return ngx.exit(ngx.ERROR)
          end

          -- TODO Fake certificate phase?

          ngx.log(ngx.INFO, "attempting to terminate TLS")
        end
      end

      -- Terminate TLS
      if ssl_termination_ctx and not ngx.req.starttls(ssl_termination_ctx) then -- luacheck: ignore
        -- errors are logged by nginx core
        return ngx.exit(ngx.ERROR)
      end

      ctx.KONG_PREREAD_START = get_now()

      local route = match_t.route or EMPTY_T
      local service = match_t.service or EMPTY_T
      local upstream_url_t = match_t.upstream_url_t

      balancer_setup_stage1(ctx, match_t.upstream_scheme,
                            upstream_url_t.type,
                            upstream_url_t.host,
                            upstream_url_t.port,
                            service, route)
    end,
    after = function(ctx)
      local ok, err, errcode = balancer_setup_stage2(ctx)
      if not ok then
        return responses.send(errcode, err)
      end

      local now = get_now()

      -- time spent in Kong's preread_by_lua
      ctx.KONG_PREREAD_TIME     = now - ctx.KONG_PREREAD_START
      ctx.KONG_PREREAD_ENDED_AT = now
      -- time spent in Kong before sending the request to upstream
      -- ngx.req.start_time() is kept in seconds with millisecond resolution.
      ctx.KONG_PROXY_LATENCY   = now - ngx.req.start_time() * 1000
      ctx.KONG_PROXIED         = true
    end
  },
  access = {
    before = function(ctx)
      -- ensure routers are up-to-date
      local cache = singletons.cache

      -- router for Routes/Services

      local router, err = get_router()
      if not router then
        return responses.send_HTTP_INTERNAL_SERVER_ERROR(
          "no router to route request (reason: " .. err ..  ")")
      end

      -- routing request

      local var = ngx.var

      ctx.KONG_ACCESS_START = get_now()

      local match_t = router.exec(ngx)
      if not match_t then
        return responses.send_HTTP_NOT_FOUND("no route found with those values")
      end

      local route              = match_t.route or EMPTY_T
      local service            = match_t.service or EMPTY_T
      local upstream_url_t     = match_t.upstream_url_t

      local realip_remote_addr = var.realip_remote_addr
      local forwarded_proto
      local forwarded_host
      local forwarded_port

      -- X-Forwarded-* Headers Parsing
      --
      -- We could use $proxy_add_x_forwarded_for, but it does not work properly
      -- with the realip module. The realip module overrides $remote_addr and it
      -- is okay for us to use it in case no X-Forwarded-For header was present.
      -- But in case it was given, we will append the $realip_remote_addr that
      -- contains the IP that was originally in $remote_addr before realip
      -- module overrode that (aka the client that connected us).

      local trusted_ip = kong.ip.is_trusted(realip_remote_addr)
      if trusted_ip then
        forwarded_proto = var.http_x_forwarded_proto or var.scheme
        forwarded_host  = var.http_x_forwarded_host  or var.host
        forwarded_port  = var.http_x_forwarded_port  or var.server_port

      else
        forwarded_proto = var.scheme
        forwarded_host  = var.host
        forwarded_port  = var.server_port
      end

      local protocols = route.protocols
      if (protocols and protocols.https and not protocols.http and forwarded_proto ~= "https")
      then
        ngx.header["connection"] = "Upgrade"
        ngx.header["upgrade"]    = "TLS/1.2, HTTP/1.1"
        return responses.send(426, "Please use HTTPS protocol")
      end

      balancer_setup_stage1(ctx, match_t.upstream_scheme,
                            upstream_url_t.type,
                            upstream_url_t.host,
                            upstream_url_t.port,
                            service, route)

      ctx.router_matches = match_t.matches

      -- `uri` is the URI with which to call upstream, as returned by the
      --       router, which might have truncated it (`strip_uri`).
      -- `host` is the original header to be preserved if set.
      var.upstream_scheme = match_t.upstream_scheme -- COMPAT: pdk
      var.upstream_uri    = match_t.upstream_uri
      var.upstream_host   = match_t.upstream_host

      -- Keep-Alive and WebSocket Protocol Upgrade Headers
      if var.http_upgrade and lower(var.http_upgrade) == "websocket" then
        var.upstream_connection = "upgrade"
        var.upstream_upgrade    = "websocket"

      else
        var.upstream_connection = "keep-alive"
      end

      -- X-Forwarded-* Headers
      local http_x_forwarded_for = var.http_x_forwarded_for
      if http_x_forwarded_for then
        var.upstream_x_forwarded_for = http_x_forwarded_for .. ", " ..
                                       realip_remote_addr

      else
        var.upstream_x_forwarded_for = var.remote_addr
      end

      var.upstream_x_forwarded_proto = forwarded_proto
      var.upstream_x_forwarded_host  = forwarded_host
      var.upstream_x_forwarded_port  = forwarded_port
    end,
    -- Only executed if the `router` module found a route and allows nginx to proxy it.
    after = function(ctx)
      local var = ngx.var

      do
        -- Nginx's behavior when proxying a request with an empty querystring
        -- `/foo?` is to keep `$is_args` an empty string, hence effectively
        -- stripping the empty querystring.
        -- We overcome this behavior with our own logic, to preserve user
        -- desired semantics.
        local upstream_uri = var.upstream_uri

        if var.is_args == "?" or sub(var.request_uri, -1) == "?" then
          var.upstream_uri = upstream_uri .. "?" .. (var.args or "")
        end
      end

      local balancer_data = ctx.balancer_data
      balancer_data.scheme = var.upstream_scheme -- COMPAT: pdk

      local ok, err, errcode = balancer_setup_stage2(ctx)
      if not ok then
        return responses.send(errcode, err)
      end

      var.upstream_scheme = balancer_data.scheme

      do
        -- set the upstream host header if not `preserve_host`
        local upstream_host = var.upstream_host

        if not upstream_host or upstream_host == "" then
          upstream_host = balancer_data.hostname

          local upstream_scheme = var.upstream_scheme
          if upstream_scheme == "http"  and balancer_data.port ~= 80 or
             upstream_scheme == "https" and balancer_data.port ~= 443
          then
            upstream_host = upstream_host .. ":" .. balancer_data.port
          end

          var.upstream_host = upstream_host
        end
      end

      local now = get_now()

      -- time spent in Kong's access_by_lua
      ctx.KONG_ACCESS_TIME     = now - ctx.KONG_ACCESS_START
      ctx.KONG_ACCESS_ENDED_AT = now
      -- time spent in Kong before sending the request to upstream
      -- ngx.req.start_time() is kept in seconds with millisecond resolution.
      ctx.KONG_PROXY_LATENCY   = now - ngx.req.start_time() * 1000
      ctx.KONG_PROXIED         = true
    end
  },
  balancer = {
    before = function(ctx)
      local balancer_data = ctx.balancer_data
      local current_try = balancer_data.tries[balancer_data.try_count]
      current_try.balancer_start = get_now()
    end,
    after = function(ctx)
      local balancer_data = ctx.balancer_data
      local current_try = balancer_data.tries[balancer_data.try_count]

      -- record try-latency
      local try_latency = get_now() - current_try.balancer_start
      current_try.balancer_latency = try_latency

      -- record overall latency
      ctx.KONG_BALANCER_TIME = (ctx.KONG_BALANCER_TIME or 0) + try_latency
    end
  },
  header_filter = {
    before = function(ctx)
      local header = ngx.header

      if not ctx.KONG_PROXIED then
        return
      end

      local now = get_now()
      -- time spent waiting for a response from upstream
      ctx.KONG_WAITING_TIME             = now - ctx.KONG_ACCESS_ENDED_AT
      ctx.KONG_HEADER_FILTER_STARTED_AT = now

      local upstream_status_header = constants.HEADERS.UPSTREAM_STATUS
      if singletons.configuration.enabled_headers[upstream_status_header] then
        header[upstream_status_header] = tonumber(sub(ngx.var.upstream_status or "", -3))
        if not header[upstream_status_header] then
          log(ERR, "failed to set ", upstream_status_header, " header")
        end
      end

      local hash_cookie = ctx.balancer_data.hash_cookie
      if not hash_cookie then
        return
      end

      local cookie = ck:new()
      local ok, err = cookie:set(hash_cookie)

      if not ok then
        log(ngx.WARN, "failed to set the cookie for hash-based load balancing: ", err,
                      " (key=", hash_cookie.key,
                      ", path=", hash_cookie.path, ")")
      end
    end,
    after = function(ctx)
      local header = ngx.header

      if ctx.KONG_PROXIED then
        if singletons.configuration.enabled_headers[constants.HEADERS.UPSTREAM_LATENCY] then
          header[constants.HEADERS.UPSTREAM_LATENCY] = ctx.KONG_WAITING_TIME
        end

        if singletons.configuration.enabled_headers[constants.HEADERS.PROXY_LATENCY] then
          header[constants.HEADERS.PROXY_LATENCY] = ctx.KONG_PROXY_LATENCY
        end

        if singletons.configuration.enabled_headers[constants.HEADERS.VIA] then
          header[constants.HEADERS.VIA] = server_header
        end

      else
        if singletons.configuration.enabled_headers[constants.HEADERS.SERVER] then
          header[constants.HEADERS.SERVER] = server_header

        else
          header[constants.HEADERS.SERVER] = nil
        end
      end
    end
  },
  body_filter = {
    after = function(ctx)
      if not ngx.arg[2] then
        return
      end

      local now = get_now()
      ctx.KONG_BODY_FILTER_ENDED_AT = now

      if ctx.KONG_PROXIED then
        -- time spent receiving the response (header_filter + body_filter)
        -- we could use $upstream_response_time but we need to distinguish the waiting time
        -- from the receiving time in our logging plugins (especially ALF serializer).
        ctx.KONG_RECEIVE_TIME = now - ctx.KONG_HEADER_FILTER_STARTED_AT
      end
    end
  },
  log = {
    after = function(ctx)
      reports.log()

      if not ctx.KONG_PROXIED then
        return
      end

      -- If response was produced by an upstream (ie, not by a Kong plugin)
      -- Report HTTP status for health checks
      local balancer_data = ctx.balancer_data
      if balancer_data and balancer_data.balancer and balancer_data.ip then
        local ip, port = balancer_data.ip, balancer_data.port

        local status = ngx.status
        if status == 504 then
          balancer_data.balancer.report_timeout(ip, port)
        else
          balancer_data.balancer.report_http_status(ip, port, status)
        end
      end
    end
  }
}
jeremyjpj0916 commented 5 years ago

Aight, yep seems the above ^ dropped in Kong can start up, we are in business folks. Time to test.

jeremyjpj0916 commented 5 years ago

@thibaultcha @bungle @p0pr0ck5

Initial results are looking quality here folks, no mutex thrown errors in STDOUT and my load test @ 100 TPS shows no availability impact to the endpoint under load during the router adds on the fly!!!! Will continue testing edge cases as well as what its like calling other existing routes during this time of load-test but this is likely the best possible outcome at the moment I could have hoped for. You guys rock.

Screenshot results @ the 100 TPS during router changes: newfirsttestsuccess

Edit - not sure on the cpu statement, revoking it.

jeremyjpj0916 commented 5 years ago

Okay we actually ran a test with 100ish TPS to 2 route endpoints(200ish TPS total at this point), and I started modifying routes again on the fly adding say 20-30 to an arbitrary service(at a rate of about 1 every 2 seconds) not related to the ones under load and I still was able to reproduce some 500 errors that showed up in about 3/6th of the worker processes. But they only persisted during the window I was making router changes while loadtesting whereas the first rc4 before the @bungle patch the errors continually persisted long after the load test was done:

2018/12/13 22:45:01 [error] 153#0: *78094 [lua] responses.lua:128: before(): no router to route request (reason: error attempting to acquire build_router lock: timeout), client: 10.*.*., server: kong, request: "GET /F5/status HTTP/1.1", host: "gateway-dev-**-*.***.com"

2018/12/13 22:45:01 [error] 153#0: *77722 [lua] responses.lua:128: before(): no router to route request (reason: error attempting to acquire build_router lock: timeout), client: 10.*.*., server: kong, request: "GET /api/demo/test HTTP/1.1", host: "gateway-dev-**-*.***.com"

We will try to whip up python scripts for building out the Kong env as well as simulating active changes under load and attach here.

Edit visual: newresults500errors

If you notice see when the tps of 200's is dipping and throughput seems impacted(didn't seem to in my prior 100 TPS test)? That signifies when the router changes started taking place and when the throughput stabilizes is when I stopped yet continued the load testing. Then you can see the blip of 500's which are all related to the build_router lock error. So seems errors can still occur and throughput is impacted when routes are being modified. The throughput drop is not the most worrisome for now but the 500 level errors is still a concern from a SLA impact.

Edit - All the 500 level errors much like prior discussed from rc testing took a TotalLatency of 5 seconds exactly.

rsbrisci commented 5 years ago

@thibaultcha you had asked for a couple of scripts to generate routes/services and to update them to reproduce the failures

PythonCreateKongProxies.zip

There's two scripts in there. They are both pretty similar.

For example, to create 500 test proxies, I'd use this command:

python create_kong_proxies.py localhost:8001 500

To add 50 routes to a service once every 2 seconds to cause failures during an active loadtest, I'd use this (assuming in both cases that localhost:8001 is the root path to my admin API):

python update_proxy_routes.py localhost:8001  12345-my-service-id-54321 50
bungle commented 5 years ago

Yes 5 seconds:

pg_timeout = 5000
cassandra_timeout = 5000

It means that requests were stuck 5 seconds waiting for semaphore to be released while other lighthread (another request) was trying to build a new router. The router building did not finish in 5 seconds, and no router was given back.

The above scenario happens on all workers on a node.

bungle commented 5 years ago

If you are concerned about the 500, obvious way to workaround is to raise those timeouts, to say 10000. Yes, I don't like it either.

jeremyjpj0916 commented 5 years ago

@bungle Thanks for the clarification. So what is actually causing the semaphore to remain locked for 5 full seconds, or any best guess? My Cassandra database was showing reads at the time in the 100-200ms range it looked like based on some configured alerts I see fire off when under load. Is there a reason why certain pending requests don't just fall back on a stale router and instead must wait on a rebuild to finish, which as shown can cause 500's if slow? I will increase timeouts and see if that provides any relief for now in my testing. And I am pondering right now if this issue would present more frequently at higher levels of concurrent traffic during changes, I assume so.

cassandralatency

I am just talking user side high level right now, and I am not sure what has to occur for this to ever be realized when service/routes add/edit/delete. But something along the lines of I make an add/edit to service or route, what have you. global flag gets set saying a rebuild is in the works in the background, all workers fall back to a cached stale router or service and use that until the flag has mentioned completed, then all workers leverage the up to date router after queries have ran and internal method checkouts confirm the newer router build has completed to success and is ready to assume the role of guiding proxy traffic correctly. If lots of requests come in changing services/routes at the time, the creation of the new service / router rebuild stops and starts from scratch and initiate rebuilding over again, all the while active proxy traffic continue to leverage that cached stale router object. This way no matter if someone was hitting the gateway with 1000 TPS against the admin API endpoint causing DB tables all kinda service/route adds/edits/deletes, active proxy traffic takes 0 impact during that window because the stale routes / service object still contain all the cached entries needed to support existing services.

I don't fully understand enough under the hood at the moment to make an accurate assessment of the blockers behind such an implementation(maybe it goes back to Robert saying a worker IPC would help, which OpenResty does not offer yet). Maybe granularity in cache is the best path forward for the real fix as opposed to change one element(service/route) -> query/rebuild all approach currently occurs under the hood for services and routes. Just something to think about as I would like to see Kong have the ability to scale to 10,000 or so proxies under 5,000+ TPS load without impact in near future(while making configuration changes on the fly to reach such a number πŸ˜„). I am sure business folk and product owners care most about new features and the like, but the nerd in me finds core improvements that dramatically improve scale, stability, or reduce latency to be much bigger achievements πŸ˜‹.

Thanks for all the dialog and quick turnaround time in terms of a patch to what seemed like a bug in rc4 yah caught, seems my rambling and screenshots around issues has at least lead to some realizations and improvements in the runloop from those in the know πŸ‘ .

Anyways, time for some sleep! -Jeremy

p0pr0ck5 commented 5 years ago

To shed some light on this:

Thanks for the clarification. So what is actually causing the semaphore to remain locked for 5 full seconds, or any best guess?

This is a function of the router rebuild:

https://github.com/Kong/kong/blob/next/kong/runloop/handler.lua#L108

The function logic looks like:

So from this we can see there are at least N queries to the database, where N is the number of Route objects (due to fetching each Route's Service directly, bypassing cache). For very large C* environments with strict consistency requirements (in which each query takes several dozen milliseconds to complete), we can easy run over a 5 seconds timeout. BTW, the mutex lock timeout is defined based on the configurable cassandra_timeout value (https://github.com/Kong/kong/blob/next/kong.conf.default#L392).

bungle commented 5 years ago

@jeremyjpj0916 Would you try this one (on top of rc4): https://github.com/Kong/kong/commit/b996f64f2973baa7ea2bfeb5c248f2756c6170ff

It does not yet pass our test suite, as I think it is triggering some issue in lua-resty-healthchecks library. But it passes almost all the tests. It is only (edit) a coutple files you need to replace or patch in rc4. Just to know if I am going in a right direction.

jeremyjpj0916 commented 5 years ago

@bungle I should be able to give it a go tomorrow, looks like you left in a little extra here on deprecated api resource? Was that intentional?

https://github.com/Kong/kong/blob/0ab98c5de605b1d33ee0384d96f254c078e6848a/kong/runloop/handler.lua#L685

And it has a few diff changes than my hacked up one above when I compared them together. You think any extra changes you added will help what I still saw with my handler(500 errors?)? Or just taking a stab at cleaning this up for the rc4 release candidate?

bungle commented 5 years ago

There are a new measures to prevent 500 for sure, so that's why it is interesting to see. Also performance wise. Edit: yes, that was intentional.

jeremyjpj0916 commented 5 years ago

Taking a look at this now, thanks for the further patches! Will update with testing results. I saw you also had a db patch put in place too, wonder if that can help some as well around crud events πŸ˜„ .

Edit - @bungle Or should I give this: https://github.com/Kong/kong/commit/72200a78a7e305f480b36bf7bddc820279140563

a go on your https://github.com/Kong/kong/tree/feat/router-rebuild-revisited branch as its more up to date with your latest commits?

Edit Edit - Looks like there is some extra "version" param getting passed into the latest commit but otherwise not much difference it seems so either one I think would work at a glance.

jeremyjpj0916 commented 5 years ago

@bungle Well Mr. Bungle sir, you have indeed done it, right direction achieved. At 100 TPS, 200 TPS, and 450 TPS unable to produce 500 level errors from Kong with https://github.com/Kong/kong/commit/72200a78a7e305f480b36bf7bddc820279140563 (which that HTTP Status error is what we saw on the earlier RC's, whereas on 0.14.1 it revealed itself as Status 404 route not found errors) under load with route changes at a rate of 2 per second at 100 and 200 tps, and I decided to go a bit gentler and modified our python script to change Kong at a rate of 1 per 10 seconds @ 450 TPS. The only behavior that occurs is throughput drops on router changes in the moment and CPU tends to increase upward an extra 30% or so under load with changes occurring(but drops back after) but that is of little concern to me currently. I know we have been stressing CRUD around route resources, but I hope the same Bungle change magic holds true for service resource CRUD as well (we may validate that one out in a later test next week as well @rsbrisci ) πŸ˜„ .

I also thought I would give yall some numbers on CPU utilization between 0.14.1 vs the RC candidates in general. Seems that @ 100 TPS on 0.14.1 looks like we see about .36 CPU core utilization, @ 100 TPS on the RC candidates it looks to be about .51 CPU utilization at 100 TPS, so some increase there, but I would definitely prefer the platform stability. With regards to memory utilization I did not see much change there.

Anyways just another testament to how quickly yall investigate and drive solutions. What was the turnaround on the rc4 bug report to what seems to be a full Bungle fix? I think 2-3 days? Not very many places you find such motivated engineers that can dig into the weeds and fix hard problems πŸ‘ , great stuff and I can take my PTO next week feeling good about 1.0 πŸŽ… . Christmas came early! Leaving this open for further discussion for a bit then we can wrap up this git issue.

bungle commented 5 years ago

Thanks! Btw. there is this line in handler.lua inside getrouter:

rebuild_router(router_wait)

You may try to change it to:

rebuild_router(0)

It will give better performance and not queue request. But more requests get served with old router until new is available. I would use 0 in production...

... You can actually comment that line out, and then router rebuilding doesn't affect requests at all. It is there as a safeguard. Making that value bigger means you get more consistency.

bungle commented 5 years ago

The increase in CPU is probably because now it also informs other workers to start building a new router. But not only that, it also informs other nodes to do it too. So new routers get build now behind the scenes (previously they got build when request saw it has a stale router).

jeremyjpj0916 commented 5 years ago

@bungle interesting stuff. I suppose all this will be squeezed into 1.0 with proper check outs? How will Kong be handling that setting in question that controls consistency levels of routes and services across worker processes? Is router_wait getting set to the same as c* timeout(sry on phone so not taken a look at code yet). Would be interesting to poll users and ask whats more important, perf or consistency across worker processes in a moment in time for service/route resources. I would lean towards perf over consistancy IF there is the promise it will be eventually consistent within a short duration(say 30 seconds or less, maybe that depends on # of services and routes). It does open a can of worms I suppose to allow eventual consistency in implementation, but cassandra does it well with degrees of settings. Kong could follow suit with a env variable like Kong_Consistency or whatever name seems appropriate where people pick strict or eventual and then descriptions of the benefits of both. I would probably pick eventual myself. But then again maybe yalls direction is to reduce user needed config options, but could always make it optional with default being strict. Something to think about.

thibaultcha commented 5 years ago

@bungle The patch sounds promising. Despite a few issues to be discussed at PR review time, it also presents the same issues as the previous iteration regarding workers consistency, and that a regression over the currently expected behavior.

Imagine 3 workers which are rebuilding their router. Worker 0 finished before worker 1 and 2, and several requests come in at this point. Now depending on which worker handles the request (something that is up to the kernel), a user might observe some requests being successfully proxied, and others being rejected since the worker's router is not up-to-date yet). This is precisely the type of issues that baffled users for a long time before the new cache mechanism/router rebuild kicked in (no complaints since then). Because of timer scheduling, the new implementation is more likely to produce this type of behavior compared to the current implementation.

I could come up with a couple of ways to keep track of router rebuilds across workers (involving the shm, not events), and "flip" to the new router once all nodes have a new router ready to kick-in (and not before). Thus, at a given time, a request comes in, detects that all routers could be flipped, and from now on, all subsequent requests will either flip to the new router and use it, or use the new router (if the worker already flipped its router).

If we don't come up with a solution for this, I am afraid we will be shipping a regression to our users and see those complaints come back. Having eventually consistent workers seems to utterly confuse our users.

I am also slightly concerned by @jeremyjpj0916's mentions of the CPU increase, which seems non-negligible. We can probably improve that as well.

@jeremyjpj0916 Thank you for giving all those patches a try! Many of those questions would probably be answered at PR review, depending on the final version of this patch. A rule of thumb we are enforcing now is that we try to avoid configuration-overhead (the introduction of unnecessary configuration options that are obscure to the user). Nonetheless, such a deep change will not be included in 1.0. I pushed a smaller change that does not refactor the router rebuild at #4084, which resolves the 500 errors you observed here (due to router semaphore timeouts). Good understanding of the implications for this can be helpful.

@rsbrisci Thank you for the scripts!


Sorry for producing such little input on my end on this issue. I have been travelling for the past 2 weeks and am now OOO until January. Ultimately, I must say that I am not 100% confident that the original issue observed by @jeremyjpj0916 on 0.14.1 here is resolved.

In short:

p0pr0ck5 commented 5 years ago

I don’t think the fallback mechanism introduced in #4098 will actually help alleviate the initial symptoms reported here. In cases where many requests queue up on a single worker and a mutex acquisition fails due to timeout, the net result is that the database stampede and the theorized race conditions in the paging iterator will still manifest, just after a timeout period where all requests are waiting. The best-effort mentality is well-intended and a good addition, but does not solve the fundamental problem. We should be aiming to solve the underlying cause of router rebuilds taking multiple seconds.

I will highlight that the semaphore does not in fact β€œstruggle” under load, it performs exactly as intended. The problem with 500 requests being sent back to the client is a result of a too-short timeout in conjunction with inefficient query patterns in the rebuild mechanism. Using the Cassandra/Postgres query timeout as the mutex timeout is not correct, as those timeouts should reflect the maximum time for one query to return, and the router rebuild requires N + (N / 1000 + 1)queries where N is the number of Routes configured in the cluster. Reducing this query count is the appropriate solution here. Possibly we should consider a separate timeout, but exposing that to users is a nonstarter as @thibaultcha mentioned above, and hard coding it feels incorrect (which is why the current implementation exists, as it was a compromise of sorts).

It should be very straightforward to have the router build function pre-fetch Service objects in order to avoid N trips back to the DB. Whether or not each router rebuild should individually re-fetch the contents of the table, or it should be kept in memory, is an implementation detail discussion (speed for memory trade off). I’m currently on mobile, but I’ll be happy to propose a POC patch that implements this later tonight (around 12 hours or so).

[edit: typo]

jeremyjpj0916 commented 5 years ago

Right so sounds like the first patchwork will be the only going in for 1.0 because it still maintains worker consistency which is a Kong primary objective (noting our environment we were able to reproduce 500 level errors in that state under duress higher than 100 TPS, which was clarified to likely be because the time it takes to iterate over our 500+ services/route pairs exceeds 5 second C* timeout limits).

It sounds like there may be a shot at getting the best of both worlds(consistency with worker processes AND no potential for failures under any circumstance) with @thibaultcha suggestion of a shm to keep track and leverage new vs stale router during rebuild. And this would come in later post 1.0 releases maybe as a minor update it sounds like.

If I were to play devils advocate on the "eventually consistent workers seems to utterly confuse our users", could the same not be said for what seems to be deemed the proper implementation which is "I have 10,000 service and route resources Kong team, I added a route and tried to call it programmatically in 2 second why did it throw 404?" Then the appropriate Kong response would be, "well the router rebuild is a global and takes some time to query all those services and routes and the worker processes had not yet agreed on the next stable router(by use of consensus through shm I suppose)." There is always going to be something to clarify to people huh πŸ˜„ .

I do think whatever the case going forward an implementation that does not impact pending requests and existing prior service/route resources has to be achieved regardless of how poor the db (postgres/cassandra) behaves. I like the idea of optimizing how services are fetched and to do local processing on the Kong node over bombarding C* with tons of individual queries per each service tied to the route(which feels slightly weird anyways because 1 service can have many routes, not 1 route has many services BUT routes are what actually expose the service resource so at the same time I get it).

I may personally hold onto Bungle's full patchwork to run with the 1.0 release because I will take consistency differences in a few seconds in time between worker processes any day over possible failures of any kind. This originating 404 issue was probably the second scare I have had with Kong, hoping to keep my sanity among how fast paced the project is tied around leveraging Kong internally. Essentially 2019 is the year for us Kong goes ga internally and shows those around us it can actually work, been in a beta of sorts now with a few production services. I expect 5,000 non-prod proxies total by eoy 2019 (the environment we currently have about 700+ proxies now in where I see the 404's on 0.14.1). And close to say 1000 production proxies, with traffic in the 2,500 TPS - 3,000 TPS range. Hopefully smooth sailing eh (never is the case in tech).

Hope yall are having a nice weekend, -Jeremy

thibaultcha commented 5 years ago

I don’t think the fallback mechanism introduced in #4098 will actually help alleviate the initial symptoms reported here.

That is not what I claimed.

thibaultcha commented 5 years ago

I actually said the exact same thing in my above comment that the underlying issue was not resolved, and laid out clearly what each contribution solves and what other steps we could take.

p0pr0ck5 commented 5 years ago

@jeremyjpj0916 I've tossed up https://github.com/Kong/kong/pull/4101 as a POC/review PR. Would you be willing to give this patch a try in your stress testing environment? It's built against the next branch and won't apply cleanly to RC4, but it should be fairly straightforward to manually transpose (let me know if you need a hand with this).

jeremyjpj0916 commented 5 years ago

@p0pr0ck5 I will give it a go on Sunday if It drops into rc4 with minimal changes needed. By the looks of it this perf should make the runloop more efficient in db interactions, maybe even keeps the 500 errors away at various tps with change rates to the routers πŸ˜„. Fingers crossed!

bungle commented 5 years ago

@jeremyjpj0916 I consider my work now more or less done with #4102 (though waiting for feedback). I know @thibaultcha argued against adding a configuration parameter that is hard to describe, but I think it is really hard to have one size fits all strategy for this without going very much back to the drawing board and implementing more granular router updates.

So all in all I think the PR #4102 covers a lot what we have discussed:

  1. timers do building router recursively and they ensure that db state is not changed during the build
  2. all workers on node do run their own timer whenever routes or services are changed
  3. other members on cluster are notified on changes and they also update all their workers (though cluster events are bit slower)
  4. services are cached on node level, so each worker can access same services, and service cache is pre-warmed on node startup

This provides 3 modes: 1. events timer only, 2. request timer and sync fallback, 3. request wait and synchronous fallback

mode 1 (the most performant):

router_timeout = -1

mode 2 (tries harder to update router by creating timer on request or fallbacking to synchronous rebuild on request, still very fast, no waiting unless fallback):

router_timeout = 0

mode 3a (waits just a 1ms and then builds synchronously if new router is not yet available):

router_timeout = 1

mode 3b (old behavior, waits at max 5 secs and then builds synchronously if new router is not yet available):

router_timeout = 5000

All the modes have events timer enabled, but modes 2 and 3 also do request time rebuilding, it is bit of a luck what gets to run it then (request or event).

I would run either 0 or -1 on production. It never gives 500 as it always returns some router. If you have time, you can try that branch with different settings described above. You can apply it directly on top of 1.0.0rc4. I am not sure can we include it for final 1.0.0(probably not), but I hope we can include it shortly after, and there your feedback comes really valuable!

Thanks.

jeremyjpj0916 commented 5 years ago

@bungle Its all great to me due to the flexibility and as someone who somewhat understands the issue and your approach to solve it, but it does sound like you may have a hard time internally convincing others because:

  1. It does introduce a new configuration to users that lacking background may choose the wrong config for their needs (the right default would be important here).

  2. Enables eventually consistent worker processes with mode 1(but probably the mode I would choose to run), which seems to be considered a regression because of confusion around behavior users noticed (but maybe with @p0pr0ck5 enhancement to how Kong pulls in service resources the time it takes for workers to reach consistency would be pretty fast, would be interesting to quantify to have an idea of long it realistically takes for say 100/1000/10000 proxy resources to reach consistency across the workers in an eventually consistent Kong environment).

  3. Because maybe the best path forward is granularity in service/route CRUD. Obviously that would be a pretty big overhaul, but is it the magic bullet Kong needs to achieve the best scale/perf/stability when CRUD occurs around proxy resources(and I suppose removes the need for new configs and the possibility of eventual consistency w worker processes)? I don't know exactly, maybe it would be a decent chunk of work for little gain. Although for how quickly you whip things up that's like 2-3 nights of pizza+beer+coding eh πŸ˜„ ?

Whatever becomes of your patch work its been awesome for us and solving the issues we see, I still struggle to understand how no one else has come across the issue and that its not easy to reproduce under realistic conditions while leveraging services and routes on 0.14.1. Maybe most of the community isn't on it yet(I still see posts from people on Kong versions like 3-4 major iterations behind 0.14.1. Or maybe its because not many run a C* setup multi DC with strict consistency which adds a little DB latency to the mix under load. Also with how slow many Enterprises move I would think most enterprise customers are still on deprecated API resources that run large Kong nodes with many proxies 🐌.

I will give your new stuff a go though, do note I will be on PTO next work week without laptop access. But can vet it more once I am back in the saddle.

Thanks! -Jeremy

bungle commented 5 years ago

Yes, we will see. While Roberts prefetch does speed things up, I would guess my proposal's caching is one step forward from that as cache is shared with all the workers and invalidated automatically and warmed on node start. Prefetch can also make small bugs as the service prefeched may already be stale when looping routes. And the approach I took is to update every worker on cluster regardless if those workers get requests when we get new router version event and when they do they already have router up to date. But well...

jeremyjpj0916 commented 5 years ago

@p0pr0ck5 This is what I came up with for the cleaned up handler based on your next branch that should work with rc4

-- Kong runloop
--
-- This consists of local_events that need to
-- be ran at the very beginning and very end of the lua-nginx-module contexts.
-- It mainly carries information related to a request from one context to the next one,
-- through the `ngx.ctx` table.
--
-- In the `access_by_lua` phase, it is responsible for retrieving the route being proxied by
-- a consumer. Then it is responsible for loading the plugins to execute on this request.
local ck          = require "resty.cookie"
local meta        = require "kong.meta"
local utils       = require "kong.tools.utils"
local Router      = require "kong.router"
local reports     = require "kong.reports"
local balancer    = require "kong.runloop.balancer"
local mesh        = require "kong.runloop.mesh"
local constants   = require "kong.constants"
local semaphore   = require "ngx.semaphore"
local responses   = require "kong.tools.responses"
local singletons  = require "kong.singletons"
local certificate = require "kong.runloop.certificate"

local kong        = kong
local tostring    = tostring
local tonumber    = tonumber
local sub         = string.sub
local lower       = string.lower
local fmt         = string.format
local sort        = table.sort
local ngx         = ngx
local log         = ngx.log
local ngx_now     = ngx.now
local update_time = ngx.update_time
local subsystem   = ngx.config.subsystem
local unpack      = unpack

local ERR         = ngx.ERR
local WARN        = ngx.WARN
local DEBUG       = ngx.DEBUG

local CACHE_ROUTER_OPTS = { ttl = 0 }
local EMPTY_T = {}

local get_router, build_router
local server_header = meta._SERVER_TOKENS
local _set_check_router_rebuild

local build_router_semaphore

local function get_now()
  update_time()
  return ngx_now() * 1000 -- time is kept in seconds with millisecond resolution.
end

do
  -- Given a protocol, return the subsystem that handles it
  local protocol_subsystem = {
    http = "http",
    https = "http",
    tcp = "stream",
    tls = "stream",
  }

  local router
  local router_version

  local function gather_services(db)
    local services_t = {}

    for service, err in db.services:each(1000) do
      if err then
        return nil, "could not load services: " .. err
      end

      if not services_t[service.id] then
        services_t[service.id] = service
      end
    end

    return services_t
  end

  build_router = function(db, version)
    local routes, i = {}, 0

    -- trade space for time by pre-fetching services in an efficient
    -- manner (paging through them, using the same page size as
    -- as implemented by the Route pager below). this lets us build
    -- the Route <> Service relationship needed by the router with
    -- minimal database traffic
    local services, err = gather_services(db)
    if err then
      return nil, err
    end

    for route, err in db.routes:each(1000) do
      if err then
        return nil, "could not load routes: " .. err
      end

      local service_pk = route.service

      if not service_pk then
        return nil, "route (" .. route.id .. ") is not associated with service"
      end

      -- service_pk is a table of { id = <uuid> }
      local service = services[service_pk.id]
      if not service then
        return nil, "could not find service for route (" .. route.id .. "): " ..
                    err
      end

      local stype = protocol_subsystem[service.protocol]
      if subsystem == stype then
        local r = {
          route   = route,
          service = service,
        }

        if stype == "http" and route.hosts then
          -- TODO: headers should probably be moved to route
          r.headers = {
            host = route.hosts,
          }
        end

        i = i + 1
        routes[i] = r
      end
    end

    sort(routes, function(r1, r2)
      r1, r2 = r1.route, r2.route
      if r1.regex_priority == r2.regex_priority then
        return r1.created_at < r2.created_at
      end
      return r1.regex_priority > r2.regex_priority
    end)

    local err
    router, err = Router.new(routes)
    if not router then
      return nil, "could not create router: " .. err
    end

    if version then
      router_version = version
    end

    singletons.router = router

    return true
  end

  local function check_router_rebuild()
    -- we might not need to rebuild the router (if we were not
    -- the first request in this process to enter this code path)
    -- check again and rebuild only if necessary
    local version, err = singletons.cache:get("router:version",
                                              CACHE_ROUTER_OPTS,
                                              utils.uuid)
    if err then
      log(ngx.CRIT, "could not ensure router is up to date: ", err)
      return nil, err
    end

    if version == router_version then
      return true
    end

    -- router needs to be rebuilt in this worker
    log(DEBUG, "rebuilding router")

    local ok, err = build_router(singletons.db, version)
    if not ok then
      log(ngx.CRIT, "could not rebuild router: ", err)
      return nil, err
    end

    return true
  end

  -- for unit-testing purposes only
  _set_check_router_rebuild = function(f)
    check_router_rebuild = f
  end

  get_router = function()
    local version, err = singletons.cache:get("router:version",
                                              CACHE_ROUTER_OPTS,
                                              utils.uuid)
    if err then
      log(ngx.CRIT, "could not ensure router is up to date: ", err)
      return nil, err
    end

    if version == router_version then
      return router
    end

    -- wrap router rebuilds in a per-worker mutex (via ngx.semaphore)
    -- this prevents dogpiling the database during rebuilds in
    -- high-concurrency traffic patterns
    -- requests that arrive on this process during a router rebuild will be
    -- queued. once the semaphore resource is acquired we re-check the
    -- router version again to prevent unnecessary subsequent rebuilds

    local timeout = 60
    if singletons.configuration.database == "cassandra" then
      -- cassandra_timeout is defined in ms
      timeout = singletons.configuration.cassandra_timeout / 1000

    elseif singletons.configuration.database == "postgres" then
      -- pg_timeout is defined in ms
      timeout = singletons.configuration.pg_timeout / 1000
    end

    -- acquire lock
    local lok, err = build_router_semaphore:wait(timeout)
    if not lok then
      if err ~= "timeout" then
        return nil, "error attempting to acquire build_router lock: " .. err
      end

      log(WARN, "bypassing build_router lock: timeout")
    end

    local pok, ok, err = pcall(check_router_rebuild)

    if lok then
      -- release lock
      build_router_semaphore:post(1)
    end

    if not pok then
      return nil, ok
    end

    if not ok then
      return nil, err
    end

    return router
  end
end

local function balancer_setup_stage1(ctx, scheme, host_type, host, port,
                                     service, route)
  local balancer_data = {
    scheme         = scheme,    -- scheme for balancer: http, https
    type           = host_type, -- type of 'host': ipv4, ipv6, name
    host           = host,      -- target host per `upstream_url`
    port           = port,      -- final target port
    try_count      = 0,         -- retry counter
    tries          = {},        -- stores info per try
    ssl_ctx        = kong.default_client_ssl_ctx, -- SSL_CTX* to use
    -- ip          = nil,       -- final target IP address
    -- balancer    = nil,       -- the balancer object, if any
    -- hostname    = nil,       -- hostname of the final target IP
    -- hash_cookie = nil,       -- if Upstream sets hash_on_cookie
  }

  -- TODO: this is probably not optimal
  do
    local retries = service.retries
    if retries then
      balancer_data.retries = retries

    else
      balancer_data.retries = 5
    end

    local connect_timeout = service.connect_timeout 

    if connect_timeout then
      balancer_data.connect_timeout = connect_timeout

    else
      balancer_data.connect_timeout = 60000
    end

    local send_timeout = service.write_timeout 

    if send_timeout then
      balancer_data.send_timeout = send_timeout

    else
      balancer_data.send_timeout = 60000
    end

    local read_timeout = service.read_timeout 

    if read_timeout then
      balancer_data.read_timeout = read_timeout

    else
      balancer_data.read_timeout = 60000
    end
  end

  ctx.service          = service
  ctx.route            = route
  ctx.balancer_data    = balancer_data
  ctx.balancer_address = balancer_data -- for plugin backward compatibility
end

local function balancer_setup_stage2(ctx)
  local balancer_data = ctx.balancer_data

  do -- Check for KONG_ORIGINS override
    local origin_key = balancer_data.scheme .. "://" ..
                       utils.format_host(balancer_data)
    local origin = singletons.origins[origin_key]
    if origin then
      balancer_data.scheme = origin.scheme
      balancer_data.type = origin.type
      balancer_data.host = origin.host
      balancer_data.port = origin.port
    end
  end

  local ok, err, errcode = balancer.execute(balancer_data, ctx)
  if not ok and errcode == 500 then
    err = "failed the initial dns/balancer resolve for '" ..
          balancer_data.host .. "' with: " .. tostring(err)
  end

  return ok, err, errcode
end

-- in the table below the `before` and `after` is to indicate when they run:
-- before or after the plugins
return {
  build_router     = build_router,

  -- exported for unit-testing purposes only
  _set_check_router_rebuild = _set_check_router_rebuild,

  init_worker = {
    before = function()
      reports.init_worker()

      -- initialize local local_events hooks
      local db             = singletons.db
      local dao            = singletons.dao
      local cache          = singletons.cache
      local worker_events  = singletons.worker_events
      local cluster_events = singletons.cluster_events

      -- get a cache key using either the old or new DAO
      local function get_cache_key(data, entity)
        if data.schema.table then
          return dao[data.schema.table]:entity_cache_key(entity)
        else
          return db[data.schema.name]:cache_key(entity)
        end
      end

      -- events dispatcher

      worker_events.register(function(data)
        if not data.schema then
          log(ngx.ERR, "[events] missing schema in crud subscriber")
          return
        end

        if not data.entity then
          log(ngx.ERR, "[events] missing entity in crud subscriber")
          return
        end

        -- invalidate this entity anywhere it is cached if it has a
        -- caching key

        local cache_key = get_cache_key(data, data.entity)

        if cache_key then
          cache:invalidate(cache_key)
        end

        -- if we had an update, but the cache key was part of what was updated,
        -- we need to invalidate the previous entity as well

        if data.old_entity then
          cache_key = get_cache_key(data, data.old_entity)
          if cache_key then
            cache:invalidate(cache_key)
          end
        end

        if not data.operation then
          log(ngx.ERR, "[events] missing operation in crud subscriber")
          return
        end

        -- public worker events propagation

        local entity_channel           = data.schema.table or data.schema.name
        local entity_operation_channel = fmt("%s:%s", entity_channel,
                                             data.operation)

        -- crud:routes
        local _, err = worker_events.post_local("crud", entity_channel, data)
        if err then
          log(ngx.ERR, "[events] could not broadcast crud event: ", err)
          return
        end

        -- crud:routes:create
        _, err = worker_events.post_local("crud", entity_operation_channel, data)
        if err then
          log(ngx.ERR, "[events] could not broadcast crud event: ", err)
          return
        end
      end, "dao:crud")

      -- local events (same worker)

      worker_events.register(function()
        log(DEBUG, "[events] Route updated, invalidating router")
        cache:invalidate("router:version")
      end, "crud", "routes")

      worker_events.register(function(data)
        if data.operation ~= "create" and
           data.operation ~= "delete"
        then
          -- no need to rebuild the router if we just added a Service
          -- since no Route is pointing to that Service yet.
          -- ditto for deletion: if a Service if being deleted, it is
          -- only allowed because no Route is pointing to it anymore.
          log(DEBUG, "[events] Service updated, invalidating router")
          cache:invalidate("router:version")
        end
      end, "crud", "services")

      worker_events.register(function(data)
        -- assume an update doesnt also change the whole entity!
        if data.operation ~= "update" then
          log(DEBUG, "[events] Plugin updated, invalidating plugins map")
          cache:invalidate("plugins_map:version")
        end
      end, "crud", "plugins")

      -- SSL certs / SNIs invalidations

      worker_events.register(function(data)
        log(DEBUG, "[events] SNI updated, invalidating cached certificates")
        local sn = data.entity

        cache:invalidate("certificates:" .. sn.name)
      end, "crud", "snis")

      worker_events.register(function(data)
        log(DEBUG, "[events] SSL cert updated, invalidating cached certificates")
        local certificate = data.entity

        for sn, err in db.snis:each_for_certificate({ id = certificate.id }) do
          if err then
            log(ERR, "[events] could not find associated snis for certificate: ",
                     err)
            break
          end

          cache:invalidate("certificates:" .. sn.name)
        end
      end, "crud", "certificates")

      -- target updates

      -- worker_events local handler: event received from DAO
      worker_events.register(function(data)
        local operation = data.operation
        local target = data.entity
        -- => to worker_events node handler
        local ok, err = worker_events.post("balancer", "targets", {
          operation = data.operation,
          entity = data.entity,
        })
        if not ok then
          log(ERR, "failed broadcasting target ",
              operation, " to workers: ", err)
        end
        -- => to cluster_events handler
        local key = fmt("%s:%s", operation, target.upstream.id)
        ok, err = cluster_events:broadcast("balancer:targets", key)
        if not ok then
          log(ERR, "failed broadcasting target ", operation, " to cluster: ", err)
        end
      end, "crud", "targets")

      -- worker_events node handler
      worker_events.register(function(data)
        local operation = data.operation
        local target = data.entity

        -- => to balancer update
        balancer.on_target_event(operation, target)
      end, "balancer", "targets")

      -- cluster_events handler
      cluster_events:subscribe("balancer:targets", function(data)
        local operation, key = unpack(utils.split(data, ":"))
        -- => to worker_events node handler
        local ok, err = worker_events.post("balancer", "targets", {
          operation = operation,
          entity = {
            upstream = { id = key },
          }
        })
        if not ok then
          log(ERR, "failed broadcasting target ", operation, " to workers: ", err)
        end
      end)

      -- manual health updates
      cluster_events:subscribe("balancer:post_health", function(data)
        local pattern = "([^|]+)|([^|]+)|([^|]+)|([^|]+)|(.*)"
        local ip, port, health, id, name = data:match(pattern)
        port = tonumber(port)
        local upstream = { id = id, name = name }
        local ok, err = balancer.post_health(upstream, ip, port, health == "1")
        if not ok then
          log(ERR, "failed posting health of ", name, " to workers: ", err)
        end
      end)

      -- upstream updates

      -- worker_events local handler: event received from DAO
      worker_events.register(function(data)
        local operation = data.operation
        local upstream = data.entity
        -- => to worker_events node handler
        local ok, err = worker_events.post("balancer", "upstreams", {
          operation = data.operation,
          entity = data.entity,
        })
        if not ok then
          log(ERR, "failed broadcasting upstream ",
              operation, " to workers: ", err)
        end
        -- => to cluster_events handler
        local key = fmt("%s:%s:%s", operation, upstream.id, upstream.name)
        ok, err = cluster_events:broadcast("balancer:upstreams", key)
        if not ok then
          log(ERR, "failed broadcasting upstream ", operation, " to cluster: ", err)
        end
      end, "crud", "upstreams")

      -- worker_events node handler
      worker_events.register(function(data)
        local operation = data.operation
        local upstream = data.entity

        -- => to balancer update
        balancer.on_upstream_event(operation, upstream)
      end, "balancer", "upstreams")

      cluster_events:subscribe("balancer:upstreams", function(data)
        local operation, id, name = unpack(utils.split(data, ":"))
        -- => to worker_events node handler
        local ok, err = worker_events.post("balancer", "upstreams", {
          operation = operation,
          entity = {
            id = id,
            name = name,
          }
        })
        if not ok then
          log(ERR, "failed broadcasting upstream ", operation, " to workers: ", err)
        end
      end)

      -- initialize balancers for active healthchecks
      ngx.timer.at(0, function()
        balancer.init()
      end)

      do
        local err

        build_router_semaphore, err = semaphore.new()
        if err then
          log(ngx.CRIT, "failed to create build_router_semaphore: ", err)
        end

        build_router_semaphore:post(1)
      end
    end
  },
  certificate = {
    before = function(_)
      certificate.execute()
    end
  },
  rewrite = {
    before = function(ctx)
      ctx.KONG_REWRITE_START = get_now()
      mesh.rewrite()
    end,
    after = function(ctx)
      ctx.KONG_REWRITE_TIME = get_now() - ctx.KONG_REWRITE_START -- time spent in Kong's rewrite_by_lua
    end
  },
  preread = {
    before = function(ctx)
      local router, err = get_router()
      if not router then
        log(ERR, "no router to route connection (reason: " .. err .. ")")
        return ngx.exit(500)
      end

      local match_t = router.exec(ngx)
      if not match_t then
        log(ERR, "no Route found with those values")
        return ngx.exit(500)
      end

      local var = ngx.var

      local ssl_termination_ctx -- OpenSSL SSL_CTX to use for termination

      local ssl_preread_alpn_protocols = var.ssl_preread_alpn_protocols
      -- ssl_preread_alpn_protocols is a comma separated list
      -- see https://trac.nginx.org/nginx/ticket/1616
      if ssl_preread_alpn_protocols and
         ssl_preread_alpn_protocols:find(mesh.get_mesh_alpn(), 1, true) then
        -- Is probably an incoming service mesh connection
        -- terminate service-mesh Mutual TLS
        ssl_termination_ctx = mesh.mesh_server_ssl_ctx
        ctx.is_service_mesh_request = true
      else
        -- TODO: stream router should decide if TLS is terminated or not
        -- XXX: for now, use presence of SNI to terminate.
        local sni = var.ssl_preread_server_name
        if sni then
          ngx.log(ngx.DEBUG, "SNI: ", sni)

          local err
          ssl_termination_ctx, err = certificate.find_certificate(sni)
          if not ssl_termination_ctx then
            ngx.log(ngx.ERR, err)
            return ngx.exit(ngx.ERROR)
          end

          -- TODO Fake certificate phase?

          ngx.log(ngx.INFO, "attempting to terminate TLS")
        end
      end

      -- Terminate TLS
      if ssl_termination_ctx and not ngx.req.starttls(ssl_termination_ctx) then -- luacheck: ignore
        -- errors are logged by nginx core
        return ngx.exit(ngx.ERROR)
      end

      ctx.KONG_PREREAD_START = get_now()

      local route = match_t.route or EMPTY_T
      local service = match_t.service or EMPTY_T
      local upstream_url_t = match_t.upstream_url_t

      balancer_setup_stage1(ctx, match_t.upstream_scheme,
                            upstream_url_t.type,
                            upstream_url_t.host,
                            upstream_url_t.port,
                            service, route)
    end,
    after = function(ctx)
      local ok, err, errcode = balancer_setup_stage2(ctx)
      if not ok then
        return responses.send(errcode, err)
      end

      local now = get_now()

      -- time spent in Kong's preread_by_lua
      ctx.KONG_PREREAD_TIME     = now - ctx.KONG_PREREAD_START
      ctx.KONG_PREREAD_ENDED_AT = now
      -- time spent in Kong before sending the request to upstream
      -- ngx.req.start_time() is kept in seconds with millisecond resolution.
      ctx.KONG_PROXY_LATENCY   = now - ngx.req.start_time() * 1000
      ctx.KONG_PROXIED         = true
    end
  },
  access = {
    before = function(ctx)
      -- ensure routers are up-to-date
      local cache = singletons.cache

      -- router for Routes/Services

      local router, err = get_router()
      if not router then
        return responses.send_HTTP_INTERNAL_SERVER_ERROR(
          "no router to route request (reason: " .. err ..  ")")
      end

      -- routing request

      local var = ngx.var

      ctx.KONG_ACCESS_START = get_now()

      local match_t = router.exec(ngx)
      if not match_t then
          return responses.send_HTTP_NOT_FOUND("no route found with those values")
      end

      local route              = match_t.route or EMPTY_T
      local service            = match_t.service or EMPTY_T
      local upstream_url_t     = match_t.upstream_url_t

      local realip_remote_addr = var.realip_remote_addr
      local forwarded_proto
      local forwarded_host
      local forwarded_port

      -- X-Forwarded-* Headers Parsing
      --
      -- We could use $proxy_add_x_forwarded_for, but it does not work properly
      -- with the realip module. The realip module overrides $remote_addr and it
      -- is okay for us to use it in case no X-Forwarded-For header was present.
      -- But in case it was given, we will append the $realip_remote_addr that
      -- contains the IP that was originally in $remote_addr before realip
      -- module overrode that (aka the client that connected us).

      local trusted_ip = kong.ip.is_trusted(realip_remote_addr)
      if trusted_ip then
        forwarded_proto = var.http_x_forwarded_proto or var.scheme
        forwarded_host  = var.http_x_forwarded_host  or var.host
        forwarded_port  = var.http_x_forwarded_port  or var.server_port

      else
        forwarded_proto = var.scheme
        forwarded_host  = var.host
        forwarded_port  = var.server_port
      end

      local protocols = route.protocols
      if (protocols and protocols.https and not protocols.http and forwarded_proto ~= "https") then
        ngx.header["connection"] = "Upgrade"
        ngx.header["upgrade"]    = "TLS/1.2, HTTP/1.1"
        return responses.send(426, "Please use HTTPS protocol")
      end

      balancer_setup_stage1(ctx, match_t.upstream_scheme,
                            upstream_url_t.type,
                            upstream_url_t.host,
                            upstream_url_t.port,
                            service, route)

      ctx.router_matches = match_t.matches

      -- `uri` is the URI with which to call upstream, as returned by the
      --       router, which might have truncated it (`strip_uri`).
      -- `host` is the original header to be preserved if set.
      var.upstream_scheme = match_t.upstream_scheme -- COMPAT: pdk
      var.upstream_uri    = match_t.upstream_uri
      var.upstream_host   = match_t.upstream_host

      -- Keep-Alive and WebSocket Protocol Upgrade Headers
      if var.http_upgrade and lower(var.http_upgrade) == "websocket" then
        var.upstream_connection = "upgrade"
        var.upstream_upgrade    = "websocket"

      else
        var.upstream_connection = "keep-alive"
      end

      -- X-Forwarded-* Headers
      local http_x_forwarded_for = var.http_x_forwarded_for
      if http_x_forwarded_for then
        var.upstream_x_forwarded_for = http_x_forwarded_for .. ", " ..
                                       realip_remote_addr

      else
        var.upstream_x_forwarded_for = var.remote_addr
      end

      var.upstream_x_forwarded_proto = forwarded_proto
      var.upstream_x_forwarded_host  = forwarded_host
      var.upstream_x_forwarded_port  = forwarded_port
    end,
    -- Only executed if the `router` module found a route and allows nginx to proxy it.
    after = function(ctx)
      local var = ngx.var

      do
        -- Nginx's behavior when proxying a request with an empty querystring
        -- `/foo?` is to keep `$is_args` an empty string, hence effectively
        -- stripping the empty querystring.
        -- We overcome this behavior with our own logic, to preserve user
        -- desired semantics.
        local upstream_uri = var.upstream_uri

        if var.is_args == "?" or sub(var.request_uri, -1) == "?" then
          var.upstream_uri = upstream_uri .. "?" .. (var.args or "")
        end
      end

      local balancer_data = ctx.balancer_data
      balancer_data.scheme = var.upstream_scheme -- COMPAT: pdk

      local ok, err, errcode = balancer_setup_stage2(ctx)
      if not ok then
        return responses.send(errcode, err)
      end

      var.upstream_scheme = balancer_data.scheme

      do
        -- set the upstream host header if not `preserve_host`
        local upstream_host = var.upstream_host

        if not upstream_host or upstream_host == "" then
          upstream_host = balancer_data.hostname

          local upstream_scheme = var.upstream_scheme
          if upstream_scheme == "http"  and balancer_data.port ~= 80 or
             upstream_scheme == "https" and balancer_data.port ~= 443
          then
            upstream_host = upstream_host .. ":" .. balancer_data.port
          end

          var.upstream_host = upstream_host
        end
      end

      local now = get_now()

      -- time spent in Kong's access_by_lua
      ctx.KONG_ACCESS_TIME     = now - ctx.KONG_ACCESS_START
      ctx.KONG_ACCESS_ENDED_AT = now
      -- time spent in Kong before sending the request to upstream
      -- ngx.req.start_time() is kept in seconds with millisecond resolution.
      ctx.KONG_PROXY_LATENCY   = now - ngx.req.start_time() * 1000
      ctx.KONG_PROXIED         = true
    end
  },
  balancer = {
    before = function(ctx)
      local balancer_data = ctx.balancer_data
      local current_try = balancer_data.tries[balancer_data.try_count]
      current_try.balancer_start = get_now()
    end,
    after = function(ctx)
      local balancer_data = ctx.balancer_data
      local current_try = balancer_data.tries[balancer_data.try_count]

      -- record try-latency
      local try_latency = get_now() - current_try.balancer_start
      current_try.balancer_latency = try_latency

      -- record overall latency
      ctx.KONG_BALANCER_TIME = (ctx.KONG_BALANCER_TIME or 0) + try_latency
    end
  },
  header_filter = {
    before = function(ctx)
      local header = ngx.header

      if not ctx.KONG_PROXIED then
        return
      end

      local now = get_now()
      -- time spent waiting for a response from upstream
      ctx.KONG_WAITING_TIME             = now - ctx.KONG_ACCESS_ENDED_AT
      ctx.KONG_HEADER_FILTER_STARTED_AT = now

      local upstream_status_header = constants.HEADERS.UPSTREAM_STATUS
      if singletons.configuration.enabled_headers[upstream_status_header] then
        header[upstream_status_header] = tonumber(sub(ngx.var.upstream_status or "", -3))
        if not header[upstream_status_header] then
          log(ERR, "failed to set ", upstream_status_header, " header")
        end
      end

      local hash_cookie = ctx.balancer_data.hash_cookie
      if not hash_cookie then
        return
      end

      local cookie = ck:new()
      local ok, err = cookie:set(hash_cookie)

      if not ok then
        log(ngx.WARN, "failed to set the cookie for hash-based load balancing: ", err,
                      " (key=", hash_cookie.key,
                      ", path=", hash_cookie.path, ")")
      end
    end,
    after = function(ctx)
      local header = ngx.header

      if ctx.KONG_PROXIED then
        if singletons.configuration.enabled_headers[constants.HEADERS.UPSTREAM_LATENCY] then
          header[constants.HEADERS.UPSTREAM_LATENCY] = ctx.KONG_WAITING_TIME
        end

        if singletons.configuration.enabled_headers[constants.HEADERS.PROXY_LATENCY] then
          header[constants.HEADERS.PROXY_LATENCY] = ctx.KONG_PROXY_LATENCY
        end

        if singletons.configuration.enabled_headers[constants.HEADERS.VIA] then
          header[constants.HEADERS.VIA] = server_header
        end

      else
        if singletons.configuration.enabled_headers[constants.HEADERS.SERVER] then
          header[constants.HEADERS.SERVER] = server_header

        else
          header[constants.HEADERS.SERVER] = nil
        end
      end
    end
  },
  body_filter = {
    after = function(ctx)
      if not ngx.arg[2] then
        return
      end

      local now = get_now()
      ctx.KONG_BODY_FILTER_ENDED_AT = now

      if ctx.KONG_PROXIED then
        -- time spent receiving the response (header_filter + body_filter)
        -- we could use $upstream_response_time but we need to distinguish the waiting time
        -- from the receiving time in our logging plugins (especially ALF serializer).
        ctx.KONG_RECEIVE_TIME = now - ctx.KONG_HEADER_FILTER_STARTED_AT
      end
    end
  },
  log = {
    after = function(ctx)
      reports.log()

      if not ctx.KONG_PROXIED then
        return
      end

      -- If response was produced by an upstream (ie, not by a Kong plugin)
      -- Report HTTP status for health checks
      local balancer_data = ctx.balancer_data
      if balancer_data and balancer_data.balancer and balancer_data.ip then
        local ip, port = balancer_data.ip, balancer_data.port

        local status = ngx.status
        if status == 504 then
          balancer_data.balancer.report_timeout(ip, port)
        else
          balancer_data.balancer.report_http_status(ip, port, status)
        end
      end
    end
  }
}

Will try dropping in just this changed handler and giving a quick stress test and will edit with results.