Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.

https://konghq.com/install/

Apache License 2.0

39.38k stars 4.83k forks source link

Kong 1.0 "5K resource challenge/Plugin+Router map build" findings/errors(Pending Bungle PR fixes this) #4194

Closed jeremyjpj0916 closed 5 years ago

jeremyjpj0916 commented 5 years ago

Summary

Made it my weekend mission to vet Kong 1.0 as hard as I could in our real sandbox environment. Many awesome PRs have brought great improvements based on earlier reports. I decided to crank it up to a final test, I build out my Sandbox Kong to:

5k service/route pairs. Each with oauth2 + acl with a whitelist entry (so 2 plugins per route). 5k consumers each with an acl whitelist group added and a pair of oauth2/jwt creds

Additional PRs I included in my Kong 1.0 patched version: 4102 4179 4178 4177 4182

I did so with a little script in a loop calling the admin API from a remote client. Of the resources being built out I would say maybe 5 times total a few 500 level error responses occurred directly from the Admin API calls or from my lb health check pings during the execution of the admin api calls, figured I would list some of those here if they bring any insight or room for improvements:

First error came from my standard 5 second health-check ping calls against the gateway while under resource creation load:

2019/01/13 21:07:05 [crit] 35#0: *1415187 [lua] init.lua:639: rewrite(): could not ensure plugins map is up to date: failed to rebuild plugins map: [cassandra] could not execute page query: [Server error] java.lang.NullPointerException, client: 10.xxx.x.x, server: kong, request: "GET /F5/status HTTP/1.1", host: "gateway-dev-core-dc2.company.com"

Any way to add retry/hardening logic around this, strange seeing a null pointer exception being thrown(from C itself considering is a logged java error?). Easy to say well C caused it not Kong's problem, but is it possible Kong could safely and effectively ascend to attempt to remediate low level db query failures between client->server?

Another one looked like this:

2019/01/13 22:15:06 [error] 38#0: *1518092 [kong] init.lua:86 could not execute page query: [Server error] java.lang.NullPointerException, client: 10.xxx.xx.x, server: kong_admin, request: "GET /acls?offset=EKEtFbKFb0ThvMhuC02YnisA8H%2F%2F%2Bb8A HTTP/1.1", host: "kong:8001"

Worker-event failed to post due to no memory and another timed out? Is the first an exhaustion of the Lua VM memory? My node as a whole is only using 1.2GB of 5GB mem so seems a bit odd to me, any potential for internal remediation around failed worker-events or ones that may have timed out, I don't see any kong configurations available to the conf/env variables that lets us configure the timeout on worker event data(or what is the seemingly baked in wait for worker event data)?:

2019/01/13 22:50:10 [error] 37#0: *1569611 [lua] events.lua:254: post(): worker-events: failed posting event "healthy" by "lua-resty-healthcheck [test_upstream]"; no memory, context: ngx.timer
2019/01/13 22:50:10 [error] 34#0: *1569615 [lua] events.lua:345: post(): worker-events: dropping event; waiting for event data timed out, id: 34317, client: 127.0.0.1, server: kong_admin, request: "PUT /routes/76bb1716-14d0-46fc-a4d3-9b6441e4fa6f HTTP/1.1", host: "localhost:8001"
2019/01/13 22:50:10 [error] 38#0: *1569619 [lua] events.lua:345: poll(): worker-events: dropping event; waiting for event data timed out, id: 34317, context: ngx.timer
2019/01/13 22:50:10 [error] 36#0: *1569621 [lua] events.lua:345: poll(): worker-events: dropping event; waiting for event data timed out, id: 34317, context: ngx.timer
2019/01/13 22:50:10 [error] 35#0: *1569623 [lua] events.lua:345: poll(): worker-events: dropping event; waiting for event data timed out, id: 34317, context: ngx.timer
2019/01/13 22:50:11 [error] 33#0: *1569632 [lua] events.lua:345: poll(): worker-events: dropping event; waiting for event data timed out, id: 34317, context: ngx.timer
2019/01/13 22:50:11 [error] 37#0: *1569634 [lua] events.lua:345: poll(): worker-events: dropping event; waiting for event data timed out, id: 34317, context: ngx.timer

Another interesting error, seems almost like my C* node failed the ssl handshake one time?:

2019/01/14 00:16:07 [crit] 36#0: *1704447 [lua] init.lua:520: ssl_certificate(): could not ensure plugins map is up to date: failed to rebuild plugins map: [cassandra] could not execute page query: [Server error] java.lang.NullPointerException, context: ssl_certificate_by_lua*, client: 10.xxx.x.x, server: 0.0.0.0:8443
2019/01/14 00:16:07 [crit] 36#0: *1704446 SSL_do_handshake() failed (SSL: error:1417A179:SSL routines:tls_post_process_client_hello:cert cb error) while SSL handshaking, client: 10.xxx.x.x, server: 0.0.0.0:8443

Overall I have been impressed with Kong under this 5k stress challenge. I slightly assumed all sort of timers and rebuilding logic would fail due to resource growth but I am not seeing that behavior with the PR's applied and also leveraging -1 on the router perf pr(so I suppose I am partly leveraging async router building). Proxying still happens quickly and my chaos test that kept adding and deleting a service/route pair while the node was under load worked fine.

Feel free to close this whenever. I just thought to relay what behavior I saw. Not too shabby by any means! I think core Kong is getting really close to hitting all my check boxes in recent weeks 👍 . 5k resources with good perf against a real db cluster w network latency should meet 99% of the worlds needs around serving as a gateway imo. Maybe one day in a year or two I will give the 10k challenge a spin 😆.

Additional Details & Logs

Kong version 1.0 with pending/merged perf PRs

bungle commented 5 years ago

Worker-event failed to post due to no memory

This certainly refers to this: https://github.com/Kong/kong/blob/master/kong/templates/nginx_kong.lua#L35

Maybe we'll need to adjust that a bit(?) (or find a way not to adjust it)

bungle commented 5 years ago

I think I could write a similar proposal to plugin_map_wrapper as I have done with that "router rebuild revisited". I check if I have time for that. Just a proof-of-concept type of pr. We need to discuss this internally what we want to do with those unmerged proposals, and this kind of feedback is really valuable for that. It will most probably endup in 1.2 or possibly 1.1.

bungle commented 5 years ago

@jeremyjpj0916 not sure if this has any effects, probably less so that the router rebuild thing, but still I made it to plugins map too: https://github.com/Kong/kong/pull/4202

It can possibly fix the:

2019/01/13 21:07:05 [crit] 35#0: *1415187 [lua] init.lua:639: rewrite(): could not ensure plugins map is up to date: failed to rebuild plugins map: [cassandra] could not execute page query: [Server error] java.lang.NullPointerException, client: 10.xxx.x.x, server: kong, request: "GET /F5/status HTTP/1.1", host: "gateway-dev-core-dc2.company.com"

The patch is very similar to router rebuild timer thing, so you might want to try:

router_timeout = -1
plugins_map_timeout = -1

Or start Kong with KONG_ROUTER_TIMEOUT=-1 KONG_PLUGINS_MAP_TIMEOUT=-1 kong start.

The #4202 includes router rebuild timer too as it is based on top of that.

jeremyjpj0916 commented 5 years ago

@bungle If I get some time here next few weeks I will give the "5k challenge" another go from an empty keyspace. Will let you know how the results go and if it help alleviate the "failed to rebuild plugins map" errors. Realistically 5k proxies/consumers/acls is awhile away for us, but I rather get a head start and try to point out all these problems much ahead of time so Kong can potentially harden implementations as we grow. Personally I think your PRs around the router rebuilding(and now plugins) have been an elegant solution that solves the near term issue until Kong intends to dedicate engineering time to potentially get more granular around its resource rebuilds and harden the core up further. Excited to see how yall progress the product through 2019! Thx.

jeremyjpj0916 commented 5 years ago

Also without the PR you did above I did try deleting a ton of stuff pretty quickly via the admin api(4k of my 5k resources each). that seemed to be able to produce TONS of those failed to build plugin errors like so(probably because of how fast I am deleting things with 5k resources, but when you write programatic code it would be great if Kong can handle it gracefully updating the plugins fully in background when it has the soonest available opportunity safely (and probably your PR will help!)):

2019/01/17 05:47:04 [crit] 34#0: *1724704 [lua] init.lua:639: rewrite(): could not ensure plugins map is up to date: failed to rebuild plugins map: [cassandra] could not execute page query: [Server error] java.lang.NullPointerException, client: 10.131.0.1, server: kong, request: "DELETE /admin-api/routes/0a07b0fb-80d7-491e-82aa-ef2c38cae035 HTTP/1.1", host: "gateway-dev-core-elr.optum.com"

jeremyjpj0916 commented 5 years ago

@bungle I decided to try to give your router rebuild and plugins rebuild a go on 1.0.2 with -1 hardcoded for both timeout settings. Its starting to get hard to follow the PR's #4202 #4102 against the master branch now so if you ever get a moment, or to potentially combine the PRs and rebase on top of current 1.0.2 it would help me ensure I made the right changes if it would not take to much effort. I just don't wanna screw up taking your fairly complicated PR's to prod incorrectly 😆, I feel like I did do it correctly though .

My current files are as such being : handler.txt init.txt

Going to run that through the ringer on 1.0.2 and see what results we get here.

Note of the deviations in above files are that I continued to use the singletons reference in the handler, no assert(db.close()) in init and the above includes hishamhm's fix for PUT plugins not getting built into the plugins map fix for 1.0.2.

Other things I am pondering, are since both plugins and routes are global maps that require full rebuilds why not make the 2 configurable options

router_timeout
plugins_map_timeout

One field? I believe most who leverage Kong add plugins alongside routes and services so they scale up in a similar manner for the most part(an would it really hurt on plugin map rebuilds if they didn't use plugins as often as routes? I would not think so). Maybe like

global_resources_map_timeout (too long I don't like this name really. but something of that nature)?

I still believe that I can't safely run Kong and modify on the fly under any decent throughput without leveraging your PRs that allow timer based changes that have fallback router/plugins map for C* deploys that have full replication (to support the oauth2 plugin). Would you concur or think there is still uncertainty from the investigations that took place in #4055? Debate raging on internally on if it should go into Kong or that there is a better way lol?

Will follow up with results of my testing later this weekend.

thibaultcha commented 5 years ago

@jeremyjpj0916 With 1.0.1 and 1.0.2, plus other internal priorities being worked on these days, we have delayed reviews of the router PRs until 1.1.0, for which our current timetable currently points at the second half of February. We’ll look at them soon :)

jeremyjpj0916 commented 5 years ago

@thibaultcha Thanks for the insight. I will provide some feedback based on how things go with my testing of it in our dev/stage/production environments as well. Its certainly a tricky business to make Kong work for every type of deploy and use case out there(let alone any app that has live configuration changes on the fly under load), but so far I have not been let down 😄 . Looking forward to second half of February then, keep rocking.

bungle commented 5 years ago

@jeremyjpj0916,

yes, I also thought about those two config options, and I agree that as a whole they are for discussion (and real pr should combine them). I don't love adding those config parameters, but couldn't figure out another way (the current implementation use that db timeout which we could use here, but that feels wrong). Maybe something intelligent and auto adjusting.

Though I think we can take short term goals and long term ones separately. My biased view on my PRs is that they are just evolutionary. I will continue rebasing them on latest next branch, thanks for mentioning that they are now slighly conflicting.

jeremyjpj0916 commented 5 years ago

Did some chaos testing and analysis, My current sandbox has about 1,700 service/route pairs and about 3,400 plugins(2 per route/service). Local 3 node C* cluster per dc with full replication factor over TLS.

            - name: KONG_CASSANDRA_PORT
              value: '9042'
            - name: KONG_CASSANDRA_KEYSPACE
              value: kong_dev
            - name: KONG_CASSANDRA_TIMEOUT
              value: '5000'
            - name: KONG_CASSANDRA_SSL
              value: 'on'
            - name: KONG_CASSANDRA_SSL_VERIFY
              value: 'on'
            - name: KONG_CASSANDRA_USERNAME
              value: ****
            - name: KONG_CASSANDRA_PASSWORD
              value: ****
            - name: KONG_CASSANDRA_CONSISTENCY
              value: LOCAL_QUORUM
            - name: KONG_CASSANDRA_LB_POLICY
              value: RequestDCAwareRoundRobin
            - name: KONG_CASSANDRA_LOCAL_DATACENTER
              value: DC2
            - name: KONG_DB_UPDATE_FREQUENCY
              value: '5'
            - name: KONG_DB_UPDATE_PROPAGATION
              value: '5'
            - name: KONG_DB_CACHE_TTL
              value: '0'
            - name: KONG_DNS_HOSTSFILE
              value: /etc/hosts
            - name: KONG_DNS_STALE_TTL
              value: '4'
            - name: KONG_DNS_NOT_FOUND_TTL
              value: '30'
            - name: KONG_DNS_ERROR_TTL
              value: '1'
            - name: KONG_DNS_NO_SYNC
              value: 'off'
            - name: KONG_LUA_SSL_TRUSTED_CERTIFICATE
              value: /usr/local/kong/ssl/kongcert.pem
            - name: KONG_LUA_SSL_VERIFY_DEPTH
              value: '3'
            - name: KONG_LUA_SOCKET_POOL_SIZE
              value: '30'
            - name: KONG_CASSANDRA_CONTACT_POINTS
              value: >-
                server00466,server00467,server00468,server00469,server00470,server00471

When under load hitting an existing route, what I did was around adding a new service/route pair and then deleting both every 5 seconds then waiting another 5 seconds and adding it back. This caused some minimal 404 impact but nothing in stdout indicating router rebuild issues which I found strange:

Sadly when I threw another wrench into the chaos testing it got a bit worse, the adding of the ACL plugin to the route that keeps getting added and deleted then problems begin to arise more heavily. The current logic handler I commented above is the one in reference to error logs:

2019/01/19 23:41:45 [crit] 35#0: *668449 [lua] handler.lua:395: could not asynchronously rebuild plugins map: could not load plugins: [cassandra] could not execute page query: [Server error] java.lang.NullPointerException, context: ngx.timer, client: 127.0.0.1, server: 0.0.0.0:8001
--
  | 2019/01/19 23:42:18 [crit] 35#0: *677001 [lua] handler.lua:395: could not asynchronously rebuild plugins map: could not load plugins: [cassandra] could not execute page query: [Server error] java.lang.NullPointerException, context: ngx.timer, client: 127.0.0.1, server: 0.0.0.0:8001
  | 2019/01/19 23:42:29 [crit] 37#0: *680020 [lua] handler.lua:395: could not asynchronously rebuild plugins map: could not load plugins: [cassandra] could not execute page query: [Server error] java.lang.NullPointerException, context: ngx.timer, client: 127.0.0.1, server: 0.0.0.0:8001
  | 2019/01/19 23:42:40 [crit] 34#0: *682922 [lua] handler.lua:395: could not asynchronously rebuild plugins map: could not load plugins: [cassandra] could not execute page query: [Server error] java.lang.NullPointerException, context: ngx.timer, client: 127.0.0.1, server: 0.0.0.0:8001
  | 2019/01/19 23:42:45 [crit] 38#0: *684343 [lua] handler.lua:395: could not asynchronously rebuild plugins map: could not load plugins: [cassandra] could not execute page query: [Server error] java.lang.NullPointerException, context: ngx.timer

Code that is hitting:

    local pok, ok, err = pcall(build_plugins_map, version, true)
    if not pok or not ok then
      log(CRIT, "could not asynchronously rebuild plugins map: ", ok or err)
    end

I could start correlating these error outputs above to 404 errors being thrown too on proxy calls.

6 worker processes, I was load testing an existing route at 120 TPS. Guess even -1 async setting did not stop the problem occurring. I thought the existing PRs fall back on an old router/plugins map too so the 404's were surprising to me.

At a high level I think there has to be a mechanism in place to ensure Kong stability for existing routes/plugins vs newly added resources no matter the time to execute/instability Kong and the datastore may have with one another and at any resource scale. I fear the existing proposals don't accomplish this based on current testing I did this weekend, more so when plugin changes come into play because plugins count will most likely be much higher than route count IMO. To do granular auth on the gateway you need an acl + oauth2/jwt/hmac/key/etc per route so 2 plugins to a route(sometimes more plugins if you want granular rate limiting, custom plugins ect.). So plugins map becomes more of a problem even more so than route rebuilding. I think moving away from the design of CRUD 1 element rebuild all is the only answer for a solid core foundation to the Kong app personally, but will await February findings, if yall have anything else you want me to vet or give a go I will be happy to(I may give a bit of solutioning a go as well in the handler too. Watching yalls PRs has given me a little confidence poking around core and some techniques to go off.)

bungle commented 5 years ago

Yes, this is an error that we could try to retry. Or you could set the timeout (of these two new config params) to 0 or 1 instead of -1 for a request time fallback. With async 404 is expected for short period of time. It'll be interesting to find out why this error happens in a first place (I think it is really bad sing for Cassanda to throw Null Pointer Exception and this should be reported to them too, to get a more descriptive errors. Really feels a bug in Cassandra or its driver, as even when called wrongly, it should not null pointer.

jeremyjpj0916 commented 5 years ago

Right, I am up at night here thinking this is more than just a Kong problem honestly too, these NPE's from C are strange and I can't say I am well versed in C, I will check all my C* node logs to see if I see anything to go off of there too(Wondering if yall have seen these sorts of errors in the wild too?). I will give the 0/1 a go to see if any changes. I am scaling down my Kong env in sandbox too to find the exact "breaking point" Interestingly enough as I was deleting more routes/plugins off my node the async plugins map was failing to rebuild but as my resources count got lower and lower those logs eventually stopped popping up. So in current design there must be some sort of breaking point too and I intend to find where that is within our own little sandbox. Plus I have MLK day off so plenty of time to play on this 😆 .

Edit I may be onto something here doods, Cassandra is somehow getting rekt:

WARN  [ReadStage-3] 2019-01-20 00:23:13,165 ReadCommand.java:555 - Read 1000 live rows and 2013 tombstone cells for query SELECT * FROM kong_dev.plugins WHERE token(id) > token(08f058e9-e20
3-4d7c-8217-697817f8427a) LIMIT 1000 (see tombstone_warn_threshold)
WARN  [ReadStage-1] 2019-01-20 00:23:13,259 ReadCommand.java:555 - Read 666 live rows and 2300 tombstone cells for query SELECT * FROM kong_dev.routes WHERE partition = routes AND id > 9762
ef09-d21b-4f48-98aa-c12d8ded2cf6 LIMIT 1000 (see tombstone_warn_threshold)
ERROR [Native-Transport-Requests-6] 2019-01-20 00:23:13,325 QueryMessage.java:129 - Unexpected error during query
java.lang.NullPointerException: null
ERROR [Native-Transport-Requests-6] 2019-01-20 00:23:13,325 ErrorMessage.java:384 - Unexpected exception during request
java.lang.NullPointerException: null
WARN  [ReadStage-8] 2019-01-20 00:23:13,343 ReadCommand.java:555 - Read 666 live rows and 2300 tombstone cells for query SELECT * FROM kong_dev.routes WHERE partition = routes AND id > 9762
ef09-d21b-4f48-98aa-c12d8ded2cf6 LIMIT 1000 (see tombstone_warn_threshold)
WARN  [ReadStage-8] 2019-01-20 00:23:13,344 ReadCommand.java:555 - Read 666 live rows and 2300 tombstone cells for query SELECT * FROM kong_dev.routes WHERE partition = routes AND id > 9762
ef09-d21b-4f48-98aa-c12d8ded2cf6 LIMIT 1000 (see tombstone_warn_threshold)
WARN  [ReadStage-1] 2019-01-20 00:23:13,491 ReadCommand.java:555 - Read 1000 live rows and 3265 tombstone cells for query SELECT * FROM kong_dev.routes WHERE partition = routes LIMIT 1000 (
see tombstone_warn_threshold)
WARN  [ReadStage-2] 2019-01-20 00:23:13,594 ReadCommand.java:555 - Read 665 live rows and 2299 tombstone cells for query SELECT * FROM kong_dev.routes WHERE partition = routes AND id > 977d
745e-edb2-41a6-9906-046d8a7b327f LIMIT 1000 (see tombstone_warn_threshold)

Anyone got friends at Apache that can make Cassandra better lol... I wonder IF this error with C* was fixed then maybe your pending PRs would be fully effective. Cassandra version 3.11.2 is what we have been running. I wonder if there is a later iteration that fixes this: "ERROR [Native-Transport-Requests-6] 2019-01-20 00:23:13,325 QueryMessage.java:129 - Unexpected error during query java.lang.NullPointerException: null"

And then I stumble on this: https://jira.apache.org/jira/browse/CASSANDRA-14444 , I could try 3.11.3 minor version upgrade lol...

bungle commented 5 years ago

I will provide you rebased and a more robust version of this on monday. E.g. a one that retries with -1 on errors. We could also ignore error if the plugin map or router version has changed. Perhaps a limited number of tries with some sleep between the tries to overcome that NPE (which is perhaps a connectivity issue or a problem with page offset if a subsequent page query goes to different node).

jeremyjpj0916 commented 5 years ago

I don't want my dialog and rambles take away from your work priorities mate, you have helped a ton as is around this problem. I am just trying to give as much detail on this github issue thread as I can for what I am seeing so when yall do look to perf the core in Feb you have ammo and real world metrics/errors(because ultimately we have the same wants here, best Kong core implementation possible). Based on what I think currently, your impl would work(probably set the timeouts to 0 like you said because 404's on existing configs would be bad even if brief) if C was not occasionally throwing NPE on Kong client queries, the stance to re-query on a a specific failed query NPE may help as they seem to happen somewhat randomly(and maybe baked in Kong db retry is a nice hardening feature), but I think fixing an underlying C problem is on us not Kong. All our C nodes are very vanilla install. I will talk to some folk internally who know C a bit better and see what they think of those C* logs I gave above.

bungle commented 5 years ago

Yes no problem, it is going to be a small change, that doesn't really affect my priorities.

About 404.

With -1 when new route is added, it will not really be there until async timer finishes building a new one, and workers get this through worker events. In case of cluster there is also cluster events to be considered, which work based on polling (by defaut 5 secs). The bad thing about this is that if there is an error, a new one will not be build until there are changes to routes or plugins. So that's why I try to make more robust.

0 setting, the router and plugin map can still lead to 404 as the only difference is that now we check router and plugin map version also on request, but 0 still means that a new ones are built async, request can start a rebuilding timer. So this can add tiny bits to request. Good thing is that it kinda has the autorebuild, so each new request can be a start of new router or plugins map build. But as it will not make request to wait for a new one, it will use old until a new ones are ready. So can still lead to 404. Personally I think that's a good compromise. 404s should go away in a few seconds (depending how long it takes to build a new, it should not take a long or there is a bigger problem with db).

Now values greater than 0 will make sync rebuild during request. So everything that -1 and 0 does, happens here still, but if the request cannot get latest router after waiting at max that value amount of time, the request will build a router and plugins map on its own. The request will be stuck during that period and so it adds latency to request. You could set this to 1 so it means that request will just wait 1 ms for background timer rebuilds, and after that either use finished timer router or start to build a new one. This can of course make db situation worse as there can be now many many concurrent rebuilds accross the cluster. Yes this on the other hand ensures you a more up to date router and plugins map, but there is a price to be paid for such.

So -1 = only background, no impact on request, request use old until new ones are available

0 = same as above but if there wasa problem in creating background timer or background timer errored with e.g. NPE, new timer can also be created on request time (at most 1 timer per router and 1 timer per plugins map will run on each worker).

then > 0 = request will build new router sync, even when it cannot acquire a lock to build one exclusively per worker. We still run background timers as in 0, so it there is a great chance the router or plugins map doesn't need to be rebuilt during request.

Greater the number, greater the effect on request and possibly more load on db.

jeremyjpj0916 commented 5 years ago

I think I may have not clarified the chaos test actions I took to produce the the 404 behavior so let me clarify a bit:

Had the 1,700 existing routes. one of the was /api/loadtest. This is the endpoint I loadtested @ 120 TPS against a single Kong node.

The chaos testing I did is creating/deleting with 5 second waits in between a new service/route(/api/chaostest) pair with a plugin(ACL was my choice) on it every time, not ever invoking it to see if it existed, but seeing if the impact of a new "proxy" resource impacted older existing proxy resources("/api/loadtest"). The old route is the one that threw the 404s during the chaos under load("/api/loadtest").

To take a step back for a minute and make sure I understand the original issue I will restate the original problem(dating back to 0.14.1 too):

Kong's current design is to rebuild the global router/plugins map during the middle of the context of a given proxy transaction. This is problematic because

A.) Kong impacts an existing proxy request holding it up for the "build" to complete, likely causing it to error out or have a major delay if Kong has a decent number of routes/plugins to build as it takes time.

B.)Maybe a bigger problems that seem to occur are most the workers too unable to rebuild properly while under load(I don't know for sure if load is relevant for why it fails(I personally think so), could likely also be due to the C* errors I posted earlier as well?), where after the change is made many workers end up out of sync with one another. Some workers throwing 404's not found on a multitude of older routes. The current way to bring it all back right is to redeploy Kong and let it start back up pulling in all resources, at that point no 404's present itself until you start Create Update Delete on routes/plugins again.

I thought your PRs goals were to enable the ability to:

Rebuild based on timer and not in flight on a given proxy transaction when it opportunistically can when set to a certain *_timout value -1,0...
Leverage an old router / plugins maps that while the rebuild was in progress(through a mutex or flag) to ensure existing traffic through the gateway is not impacted by addition of newer resources.

Adding onto 2. ^ I would say also that if we see an NPE or db error during rebuild, continuing to leverage an old router/plugins map is also a necessity to avoid impact to the stability of Kong(I think this mission critical to stability, we can't let one bad db query in a route map/plugin map call ruin tons of gateway proxy calls).

Based on me seeing the 404's(and my explanation of what I was doing to produce the 404s) I would say the 2nd point is not functioning as expected(or not implemented, maybe I was confused on your goals with your PR).

bungle commented 5 years ago

@jeremyjpj0916,

I think you understood goals correct. -1 (has no impact on request and leverages old router and plugins map), 0 (has minimal impact on request and leverages old router and plugins map, only starts a timer if needed). Positive values are very close to current 1.0.2 behavior (but there is still the background timer and those worker events and cluster events — which ever picks it first on a worker: a) background timer or b) request).

The problem with -1 and 0 causing 404 is possible the issue with NPE. Background timer did not restart or retry it on such errors, so the new router or plugin maps was not built at all, or maybe it was on some workers that successfully did it, but maybe not all. And new build would only start when there are more modifications to routes/services and plugins. Also important thing to note is that not all workers are at same state exactly at same time even within same node with -1 and 0, and more so with cluster.

So what I did this morning is:

I rebased #4102 on top of very latest next
I rebased #4202 on top of the #4102

That was not all, I made rebuilding a lot more robust, to handle any kind of errors in timers. How I did it?

On build_router and build_plugins_map on errors we check if version is still correct, if not we just ignore error and try again.
If the version was still current, but there was error, we try at most 4 more times (the code can error in total 5 times). And before we try again, we do a small sleep, the sleep is count like this: 0.01 try try, e.g. on first error it will sleep 0.01 secs before trying again, one 5th try it will sleep 250 ms before trying again.

Also on timers I now call kong.db:connect and kong.db:setkeepalive(on errors the code also tries to switch the connection before trying again). You reported that this helped a lot in init, so I guess it will help a lot in this case too.

You can just grab this branch: https://github.com/Kong/kong/tree/feat/plugins_map_revisited

And try it (it has both router and plugins maps work on it).

jeremyjpj0916 commented 5 years ago

@bungle You have given me something to poke at today, will give it a go here on my day off 👍, thx . Does it not make sense to add a failsafe that if Kong still fails after max retries/db connection switches for workers to continue leveraging old router and plugins map for stability purposes(and go through trying to rebuild when another Create/Update/Delete occurs)?

Edit: thinking about it I guess doing a failsafe could also cause some workers to run out of date old router /plugins map while workers that succeeded are on a new one. This would be problematic, but not as catastrophic as workers failing to route older established routes which was current behavior(imo old working proxies are more important than newly created proxies in a production environment).

bungle commented 5 years ago

So yes, we do have failsafe. If we cannot get new router to build, the old stable one gets used. That 5 times try was just something as a limit. As I think if you cannot build one with 5 tries, it can tell a problem elsewhere (database down)? Maybe after 5 tries we can create another timer that runs after 1 minute. But I think you can use 0 for that too.

jeremyjpj0916 commented 5 years ago

Awesome gotcha, yeah if it can fall back on old router / plugins map during failures and retry failures that would be ideal. Fingers crossed this works! Tomorrow I will be discussing the C* errors we see internally too, NPEs should never occur with any application, Cassandra included but maybe even with those errors your new retry/db connection logic will overcome those sporadic NPEs. Thanks again, I will report back with findings later today on just testing the new logic!

bungle commented 5 years ago

@jeremyjpj0916 yes, at first check if C errors are from Driver (I guess it is from Cassandra itself, as we are using @thibaultcha's driver that has no Java in it) or C. I do have a idea, e.g. it could be that for some reason, the invalid offset may cause this. But I am not 100% sure.

And I made final touches to those #4102 and #4202 (the last one has both).

The main change was these two lines:

cache:invalidate("router:version")
-- and
cache:invalidate("plugins_map:version")

I changed to:

cache:invalidate_local("router:version")
-- and
cache:invalidate_local("plugins_map:version")

Should be even better now, and I think I can get Travis green on these.

jeremyjpj0916 commented 5 years ago

Just got back from lunch out. Trying all this stuff right now, the day I can configure a new proxy while under load without breaking our stage and now prod will be a beautiful sight 😆 . Potentially C*'s fault and certainly Kong core has room for stability improvements as you have been tackling but I feel like I chose the right gateway to leverage with how diligent you have been trying to remedy the core around issues we see. Should have results in an hour or 2!

jeremyjpj0916 commented 5 years ago

Will update as I dig into things:

Testing the gateway with minimal resources now maybe 10 total service/route pairs, doing my chaos test of adding and removing a service/route with 5 seconds wait(I am not calling the new route) while loadtesting the older existing service/route pair /api/loadtest path @ 120 TPS. This gets thrown in the logs pretty constantly:

Currently testing setting "0" for timeout as recommended(as opposed to -1).

Dropping in the "/feat/plugins_map_revisited" kong/init.ua + kong/runloop/handler.lua changes hardcoded to 0 on handler.

2019/01/21 20:11:23 [notice] 37#0: *37327 [lua] handler.lua:137: lock_router(): timeout attempting to acquire router lock, client: 10.131.0.1, server: kong, request: "POST /api/loadtest HTTP/1.1", host: "gateway-dev-core-dc2.company.com"

As far as expected HTTP statuses and successful tx's no issues just yet, seems solid:

Edit - Upped the stress by now having the chaos test also add a ACL + OAUTH2 plugin to the route(what we normally do when setting up a new proxy anyways). Doing so now reveals the timeout on acquiring plugins lock as well as router lock:

2019/01/21 20:29:09 [notice] 34#0: *68858 [lua] handler.lua:152: lock_plugins_map(): timeout attempting to acquire plugins map lock, client: 10.130.4.1, server: kong, request: "POST /api/loadtest HTTP/1.1", host: "gateway-dev-core-dc2.company.com"

2019/01/21 20:29:09 [notice] 34#0: *68858 [lua] handler.lua:137: lock_router(): timeout attempting to acquire router lock, client: 10.130.4.1, server: kong, request: "POST /api/loadtest HTTP/1.1", host: "gateway-dev-core-dc2.company.com"

These are getting thrown out super frequently to stdout constantly(70 per second maybe?).

Update - Even with these getting thrown out constantly Kong seems to be tanking away, no errors produced yet just because of the timeout locks to my existing route under load while a new service/route/plugins pairs are generated against gateway(last 5 mins of my calls):

Interestingly enough even after stopping the load test the lock timeouts still get thrown but not as frequently (just from my general ping endpoint and from the chaos test script posting to admin api every 5 seconds or so):

2019/01/21 20:41:32 [notice] 33#0: *264064 [lua] handler.lua:152: lock_plugins_map(): timeout attempting to acquire plugins map lock, context: ssl_certificate_by_lua*, client: 10.129.2.1, server: 0.0.0.0:8443

 2019/01/21 20:41:32 [notice] 33#0: *264066 [lua] handler.lua:152: lock_plugins_map(): timeout attempting to acquire plugins map lock, context: ssl_certificate_by_lua*, client: 10.130.4.1, server: 0.0.0.0:8443

 2019/01/21 20:41:32 [notice] 34#0: *264068 [lua] handler.lua:152: lock_plugins_map(): timeout attempting to acquire plugins map lock, context: ssl_certificate_by_lua*, client: 10.130.4.1, server: 0.0.0.0:8443

 2019/01/21 20:41:47 [notice] 33#0: *264384 [lua] handler.lua:152: lock_plugins_map(): timeout attempting to acquire plugins map lock, context: ssl_certificate_by_lua*, client: 10.128.2.1, server: 0.0.0.0:8443

 2019/01/21 20:41:48 [notice] 34#0: *264406 [lua] handler.lua:152: lock_plugins_map(): timeout attempting to acquire plugins map lock, context: ssl_certificate_by_lua*, client: 10.129.2.1, server: 0.0.0.0:8443

2019/01/21 20:41:48 [notice] 36#0: *264425 [lua] handler.lua:152: lock_plugins_map(): timeout attempting to acquire plugins map lock, context: ssl_certificate_by_lua*, client: 10.131.0.1, server: 0.0.0.0:8443

2019/01/21 20:41:48 [notice] 33#0: *264428 [lua] handler.lua:152: lock_plugins_map(): timeout attempting to acquire plugins map lock, context: ssl_certificate_by_lua*, client: 10.131.0.1, server: 0.0.0.0:8443
 
2019/01/21 20:41:48 [notice] 34#0: *264432 [lua] handler.lua:152: lock_plugins_map(): timeout attempting to acquire plugins map lock, client: 10.128.4.1, server: kong, request: "DELETE /admin-api/services/86907135-bcb3-407a-9d85-6919ac2ea383 HTTP/1.1", host: "gateway-dev-core-dc2.company.com"
 
2019/01/21 20:41:55 [notice] 37#0: *264565 [lua] handler.lua:152: lock_plugins_map(): timeout attempting to acquire plugins map lock, client: 10.129.2.1, server: kong, request: "GET /F5/status HTTP/1.1", host: "gateway-dev-core-dc2.company.com"

jeremyjpj0916 commented 5 years ago

Ok, next test after adding 500 service/route/plugins/consumers on the gateway(note, adding service/route/plugins/consumers and the creds threw the lock notice for routes/plugins basically every call but no admin api calls failed, all ended up succeeding to 500 resources). Then will add the chaos test calls and loadtest my existing /api/loadtest route:

Findings are identical, no problems handling changes with 500 service/route pairs with plugins while loadtesting an older route @ 120 TPS. Just more of the timeout locks thrown constantly.

Next up 1,500 resources then chaos testing around it. will update this post with info

bungle commented 5 years ago

@jeremyjpj0916,

Great and detailed reports as always from you. All log entries have been "notices", so not a lot to worry about. I'll adjust logging. Acquiring lock with 0 as timeout is expected to timeout a lot on heavy load. You can now also try -1 as it is a lot more robust now. It will silent the notices too.

jeremyjpj0916 commented 5 years ago

@bungle I try my best, not the brightest on core Kong coding yet, so I try to provide as much clarity as I can in a more QA kinda way 😄 . Looking good on the 0 setting and 1,500 resources with chaos testing too. I am going to change back to -1 now as well as you mentioned because its very chatty in my stdout seeing all those notices I can't follow to see if any true errors are being reported, I assume if I hit your error retry logic something will make the logs and I want to be sure to capture that if it occurs to log in this github issue.

Edit - yeah on -1 at 1,500 resources no errors now and my older existing route is doing its thing with all 200 successes under load! Maybe your last PR was the magic we needed. I really am starting to wonder if the db keepalive and tweaks you did made the difference hmm....

Next test up: 3,500 resources, then finally 5k.

Edit - As I am building out the routes/services/consumers/auths all the way up to 3,500 I can see in the local DC some decent CPU getting leveraged hah:

Edit - Updates on chaos testing at 3,500 routes/services/consumers and such. NO ISSUES 🚀 !!! Even using the admin API to create all of those resources I have seen not a single error yet either(where before when setting up the 5K resources it failed a few times). The bunglemania is real, Kong core might be at that sweet spot of stability around change! Giving 5k a go next!!!

jeremyjpj0916 commented 5 years ago

Okay now at 5k resources we have our results, running at -1 config while under a chaos test adding route+service pair with acl/oauth2 plugin on route and load testing my existing proxy route:

Basically the 404 issue has disappeared all together even at 5k resources, the admin api worked flawlessly all the way to 5k resources(I have never had that happen in prior attempts). And the only strange thing is NGINX is saying a small % of these transactions almost exactly .25% on the dot are Layer 4 connection timeout errors 504(with retries set to 0 on the service too):

2019/01/22 04:28:20 [error] 38#0: *1133561 upstream timed out (110: Operation timed out) while connecting to upstream, client: 10.129.2.1, server: kong, request: "POST /api/loadtest HTTP/1.1", upstream: "http://server00493:9393/ ", host: "gateway-dev-core-dc2.company.com"

You can see how it stopped on the 2 second mark with a connection timeout(what we had our service configured to timeout on for connection):

As soon as I stopped the chaos test of route/service/plugin creation and deleting(while continuing to maintain my loadtest), then the 504 reported L4 timeouts ceased immediately (how strange is this? Nginx genuinely reporting L4 timeouts kinda sporadically a few every 5-15 seconds but we know it correlates directly with when I execute the chaos test, maybe right when the post/deletes of the admin API calls are kicking off?, one to mull over a bit, its definitely not a real L4 connection error on no fault of the upstream imo... This did not occur during my 3,500 resources test while under chaos changing so maybe something at larger scale does it?)

I think it may be good to know at the core what changed from past attempt to now for internal thoughts since this last one really did it for me running on C* cluster, lets review:

Retries on failure(If you log anything error status on retries I never failed then, because I watched stdout the whole time this test was running). So we know ultimately this was not what improved behavior. I think its a good feature to have though so I would recommend leaving some variant of it in(if in fact I should have seen retry errors if failures occurred)
Local invalidations on the plugin/router maps, which I know less about the discrepancy or reasoning behind it but documenting here nonetheless 👍-
```
cache:invalidate_local("router:version")
-- and
cache:invalidate_local("plugins_map:version")
```
"Also on timers I now call kong.db:connect and kong.db:setkeepalive(on errors the code also tries to switch the connection before trying again). You reported that this helped a lot in init, so I guess it will help a lot in this case too."

If I left anything off feel free to clarify, for some reason my gut is telling me again the thing that probably caused the most improvement was point 3, but I can't be for certain without stripping away your changes 1 by 1 until the problems start back up.

And IF 3 was the reason for the better stability, does something about how Kong C client hops from node to node on paging these intensive queries potentially cause issue by nature(which I imagine 3 improves it because it causes keepalive to a specific C node its using and leverages it for the full paging query "set" till completion?), I know you also mentioned skepticism around offsets being a culprit too. At least these are some of the thoughts my mind is taking wondering how we got here 😃 .

But over all I am wowed, @bungle you did it man, a fully stable router/plugin rebuild up to 5,000 fully generated proxies/consumers. Pair that with your Kong startup time improvements and you have really been killing it on making Kong core work for the masses, these are the kinda improvements I think make major milestones and ultimately make the product a success, down in the trenches fixing hard problems. I think its important to get core right before moving horizontally with tons of new features/integration patterns or it may get harder to fix things like this in future due to coupling and integrations. Luckily most of Kong is pluggable so hopefully that issue won't befall Kong but the point still holds true generally for dev imo.

Leaving this open for now for further discussion but feel free to close when desired. I am going to close my other issue #4055 now too since this one kinda morphed into what that discussion once was.

Edit - Just programatically deleted all 5k resources using the Admin API without a problem too, the point outs I can give to improvements have no end haha. I can say another interesting point is CPU during my programmatic resource creation got really high over the extended minutes it took to build all those resources out, like taking no actual proxy traffic but just making those admin api calls in a loop from a remote host to make all 5k resources my inbound network said traffic was at 10,000kbps( I don't believe it lol), and CPU peaked up to 2 cores. As soon as the creation was done CPU dropped like a hammer back to like .02 CPU utilized lol. Now programmatic deletions of everything with admin api don't seem to be growing the CPU much at all.

bungle commented 5 years ago

@jeremyjpj0916,

That's wonderful results. I agree with you, that maintaining database connection on rebuilding is probably the biggest improvement. I have had this in my mind for sometime, especially after getting your feedback. I feel that with Cassandra the paging is only reliable when you page through the same Cassandra node. I would say that NPE's can be something that you execute another page on different node that is not yet consistent with the node the first page was executed, or perhaps it is already ahead of time of that first node. So this leads to this NPE. I think Cassandra should give something else than NPE, e.g. invalid offset or something as NPE is usually a cause of assumption gone wrong (the assumption doesn't hold on bad input -> external input is not correctly verified).

So yes, when you said that adding :connect to init made me think about this more, and then I thought, well, if that is that good on init, it could perhaps lead to great progress on rebuilding router and plugins map too. And it seems like it does.

The 504 and CPU load may seem to have some correlation. The thing here is that changes with these PR's get propagated to all workers and all cluster nodes. Which means that If you have 5 nodes, with 5 workers on each, there will be about 25 router or plugins map rebuilds running at the same time (of course if the numbers are higher, the situation is worse, but in that case you probably have many cassandra nodes, some exclusive to some nodes). In many cases the configs (routes, services and plugins) change quite rarely, but if they do change, say every half a second, this can cause some pressure. Is that a real-world scenario, I am not sure. The next improvement could be that there is only one worker rebuilding on the node, and new router is given through shm to others. But I think that's something I don't want to tackle just yet as it makes these PRs rather big (they already are quite big).

What we have learned from this conversation, can also be good in some other places in core. E.g. timers that make more than one DB query, can benefit from :connect. Queries that are paging through the database can benefit from :connect (one time I had a patch that kept the connection alive through the whole request cycle, but that went too far, and can drain the connection pool).

If these patches are not going to get in, I think it can only mean that someone builds something even better (probably a bigger change, e.g. to more granularity, though that can be challenging to make it consistent at the same time — miss one event, error on another, and you probably are in inconsistent state).

BTW. we are currently making first steps on declarative configs, which will lead to db-less mode (it's going to have some limitations and benefits).

jeremyjpj0916 commented 5 years ago

@bungle agree on basically all points here. Let me clarify one thing in what I saw. The heavy CPU saturation was when creating our all 5k resources in a loop that it reached 2+ CPU(Out of 3 allotted). And that was all that I was doing during that window, no load testing or anything else, I suspect something around the locks and rebuilds kind of builds up when you are constantly heavily adding to Kong from a client against the admin api building out lots of resources programatically, because as soon as the program finished building them all that built up CPU % that was climbing getting eaten up dropped to nothing.

Now onto the 504's I don't really recall heavy CPU when running the chaos create 1 delete 1 route/service/plugin while running the loadtest @ 120 TPS against /api/loadtest with 5k resources that produced the small 504 %'s(that I did not see occurring at 3.5k resources under load with chaos), which makes me think maybe something in flight with 5k+ resources causes a blip of sorts sometimes in the worker processes based on the set service connect timeout and proxying, hard to say.

Regardless very promising results, hats off to you taking the time and having the passion to improve the product in such a big way. This is going to our Stage and Prod very shortly as its in dire need.

Thanks!

bungle commented 5 years ago

Yes, heavy creation on such entities causes:

local worker events to be send to all workers on node that made the change
cluster event to be sent to other nodes, which will cause a worker events to be sent on all workers on receiving cluster node

Now making a lot of that will pile up the events (cluster and worker). I have some ideas for that (perhaps not utilizing worker events or cluster events on this).

jeremyjpj0916 commented 5 years ago

Been running in Stage for a bit now, there may be something in this newer design that eats and holds onto mem somewhat, not sure what took place under the hood on these jumps(new resource etc, something related to CPU increase on the first and second jump for sure) but you can see it sort of step ladders upward. Just something to think about if yall do intend to take it to production:

Also per worker GC numbers:

I have noticed "normal" is generally 50,000.00 KB for me or less when GC is functioning properly without mem leaks.

Can add this in your template to verify locally:

      location /gc {
        content_by_lua_block {
            ngx.say(string.format("Worker %d: GC size: %.3f KB", ngx.var.pid, collectgarbage("count")))
        }
      }

If I get any more details around what situation causes the mem jumps I will let yah know!

Edit - Started a loadtest with no changing of resources and so far no mem increases just under plain load so thats good.

Edit - Just gave creating a service/route/and a few plugins and deleting them a go, no jump like that.... mystery as to what does it then.. may not even be related to this PR potentially.

bungle commented 5 years ago

@jeremyjpj0916 I can push another commit (perhaps next week) to these so that they will utilize recurring timer for rebuilding rather than events. It will be more like polling, e.g. every second we check if router or plugins_map version has changed. That at least makes piling of the events go away. I'll ping you when I have pushed that.

jeremyjpj0916 commented 5 years ago

@bungle this is your masterpiece so if you believe that is the best strategy I am all for it. If polling is not an intense check and will not cause impact every second sounds fine to me. So far running this in prod and stage about the last 10+ days no issues with the current design as is besides maybe some slow mem growth possibly(but does not seem tied to network throughput increases, nor could I tie it to resource creation churn either, a mystery to me where the growth comes from but its not enough in a week to bother me at all):

Prod metrics on a given host this past week(we do rolling weekly redeploys Monday mornings:

bungle commented 5 years ago

The idea of polling is to just check the version of router/plugins map (so polling is just about checking one variable). Something we do currently per request. So that is what it is "polling" — not like making db request every second. Then if it is not current, it will start a rebuild. The main advantage of this is that this does not pile up events.

jeremyjpj0916 commented 5 years ago

@bungle Sounds like it will scale better with more nodes etc. too than the event build up probably. I am up for testing it in our sandbox through to prod if you get it out in a week or so.I would then roll it up with our Prod deploy of 1.0.3 in 2 weeks or so we are planning for(some internal backlog things we are working on right now delaying that 1.0.3 upgrade).

Thx!

bungle commented 5 years ago

@jeremyjpj0916

Both PR's are now rebased to latest next branch. In addition I did the change that I was speaking. E.g. now there is recurring timer (every second) on each worker that checks router:version and plugins_map:version, and if needed starts a rebuild.

4102 contains router only

4202 contains router and plugins maps (this is what you want)

Let me know if there are any benefits with that.

jeremyjpj0916 commented 5 years ago

@bungle I will review the code and give it a go sometime over the next week or two and get back to you(just got back from vacation last few days and have a little catching up to do). Thanks for continuing to improve/progress your implementations around the original reported problems in this thread 👍 .

jeremyjpj0916 commented 5 years ago

@bungle Okay I gave 1.03 a go today with your router/plugin map rebuild via polling and am happy to report similar stability is achieved much like the cluster event rebuild method you chose prior.

Tested at 220 TPS against an existing proxy endpoint(to make sure proxying was stable) while creating 5k new consumers + services + routes and plugins/acls and such associated much like a common gateway setup would have. Note I also still ran with the -1 setting on the router/plugin map timeouts.

One interesting error I saw 3-4 times while attempting to view consumers in the Konga UI was this error occasionally, only happened 4-5 times, I assume likely due to that theory you have around paging against multiple c* nodes(this occurred after 1000+ consumer entries existed):

2019/02/18 05:47:14 [error] 33#0: *1866033 [kong] init.lua:92 could not execute page query: [Server error] java.lang.NullPointerException, client: 10.xxx.xx.x, server: kong_admin, request: "GET /consumers?offset=EHEnlhiuOklMmea5HBERjgUA8H%2F%2F9wMA HTTP/1.1", host: "kong:8001"
2019/02/18 06:18:23 [error] 33#0: *3209162 [kong] init.lua:92 could not execute page query: [Server error] java.lang.NullPointerException, client: 10.xxx.xx.x, server: kong_admin, request: "GET /consumers?offset=EO6MK05a5Ei7oI%2BeSkyxHEoA8H%2F%2F%2B08A HTTP/1.1", host: "kong:8001"

Other things you may find interesting: As the cluster grew from 0 to 5K resources avg latency started around 70ms for me barebones and worked its way up to about 82ms @ 5k for me, not too big a deal really(12ms due to 5k resource growth, I imagine cause of cache size increases or something).

The growth I see on CPU utilization when constantly making resources is interesting. Consistently climbed in CPU growth till I reached my 5k resources. As soon as my remote admin API calls were finished it instantly dropped back down to normal levels. I think I have mentioned this before though.

Also neat watching the Kong Pod in the other DC ramp up with CPU utilization as well, means its staying in sync with the Kong pod in the current DC I am acting against. All good stuff.

I also took time to load test while flushing/deleting all 5k resources as well. Went smoothly!

bungle commented 5 years ago

Consistently climbed in CPU growth till I reached my 5k resources.

I think that is expected as the rebuilding takes longer and there is possibility that multiple workers do it at same time, than with lesser resources. I guess that explains the difference. It must be quite rare in real-world to rebuids happen that often.

bungle commented 5 years ago

2019/02/18 05:47:14 [error] 33#0: *1866033 [kong] init.lua:92 could not execute page query: [Server error] java.lang.NullPointerException

Here I think we are now hitting the same problem as before in router / plugins maps rebuilding. But this time at admin api.

@jeremyjpj0916, does all your Cassandra nodes run the exactly the same version?

@thibaultcha not sure if there is anything we can do in a client: https://www.mail-archive.com/user@cassandra.apache.org/msg45381.html

jeremyjpj0916 commented 5 years ago

@bungle agree on first point, under no real world scenario should Kong be constantly changing/resource growing as a gateway realistically on a second by second basis.

All our Cassandra nodes run: Cassandra 3.11.2

The previous fix in router/plugin map rebuild was to do keepalive to the C* host the specific query is being executing against correct? Would it make sense for the admin API calls to take the same approach?

bungle commented 5 years ago

The previous fix in router/plugin map rebuild was to do keepalive to the C* host the specific query is being executing against correct? Would it make sense for the admin API calls to take the same approach?

Yes, this was during the rebuild, but now admin api calls are two separate requests. That can endup in any node in cluster, in any worker in node. In cases where admin api actually executes multiple statements, it would make sense to have the same connection available through the whole phase. But I checked that /consumers call specifically. So there is no easy solution here. I would perhaps look into our Cassandra driver that can we do something there (e.g. a link I posted above). And then perhaps Cassandra (I guess they have fixed it already, but I think not in 3.11.2).

jeremyjpj0916 commented 5 years ago

@bungle Makes sense, if you see any indication that makes you think a later version of C* has this fixed that Kong supports I will certainly upgrade to it 👍 , I googled around the paging issue and never saw any specific changelog mentioning a problem like that was fixed in a specific version.

jeremyjpj0916 commented 5 years ago

Also preliminary, but it seems moving to the polling technique that was released over event driven seems to have my pods running cleaner on mem consumption(lower levels of memory consumed reported + growth seems slower or stabilizing). Note I upgraded to 1.03 from 1.02 at the same time this polling change was made so can't say 100% for sure exactly what it was but I assume the polling that improved it somehow.

bungle commented 5 years ago

@jeremyjpj0916,

Here is the latest PR, rebased today to our next branch: #4488. I hope we can include it in 1.2.0. If you have time, can you check about it. It contains some new tricks to make it even more performant (like the added dns cache prewarming).

jeremyjpj0916 commented 5 years ago

Still too early to drop it as an early alpha RC tarball for us to pull and test with from bintray eh? Our whole deployment model usually depends on that, but I can drop a number of 1.2 overriding files into say the 1.1* release candidate and test that way likely. Also looks like you had to drop the DNS cache prewarming, that was a neat idea, did it cache all DNS record lookups from service resources at the init of Kong to avoid needing to DNS query on the first hot path calls? Would be neat to see if it helps in a live environment, if I can find the snippit you added that does the dns cache bit ill drop it in manually to see if it actually helps any(even if it fails testsuite at the moment).

bungle commented 5 years ago

Yes, on this line: https://github.com/Kong/kong/blob/339aa124957c78b38e2152fe9bb4f3c01115d2f0/kong/runloop/handler.lua#L153

you can add:

kong.dns.toip(service.host)

It is just some of our tests that don't like it. We will figure that out and the festure will return I guess.

Edit: Need to think if it is better to slow down the worker startup a bit because of this (to make it warmer), or do DNS cache warmup in background.

jeremyjpj0916 commented 5 years ago

Very cool, yeah a simple one liner to make all the queries needed to build it out. I suppose my first concern around that without knowing much about how DNS nameservers behave is would there be circumstances where doing this in a prewarm manner might cause Kong to be blacklisted as a threat from nameservers temporarily for querying that many that quickly, or rate limited if say you have 5000 unique service hosts. If Kong's DNS lib does some error logging around that it that would help teams know if they are safe to run with that optimization or hitting walls that could cause problems briefly once Kong starts accepting proxy traffic(if blacklisted for example for a certain window). Not sure how common that would be, but other than that seems great because even we have seen hot path DNS querying at the p95 level to take 100ms+ occasionally with blips so if this works as well as it should in theory it would save on some early call latency. Would Kong report DNS lookup latency as a Kong latency(in some phase) or as upstream proxy latency? Guess I would have to do a bit of digging to see where that goes on 😄.

We are doing some fun investigations internally because we do have plenty of p99 spikes with regards to what we see around latency in general. As much as 900ms with even a few occurs of 5 second spikes and actually reportedly occurring at the gateway layer if what http-log dishes out is accurate. The avg is much lower, like 20-40ms generally. Always something to investigate in gateway land!

bungle commented 5 years ago

@jeremyjpj0916 the dns prewarming is now enabled on PR, but it is now done asynchronously, to make worker accept requests faster (even with a small possibility of request coming in with non-prewarmed upstream host).

bungle commented 5 years ago

I added also prewarming of dns on service creation and update. Plus I made a change to execute only plugins that do have a custom function, e.g. if your plugins doesn't implement :access, it will not be executed on :access phase ~(though, not sure what our test suite thinks about it)~ (vs. currently we execute it always as the BasePlugin has the default function.