Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/#kong-community
Apache License 2.0
38.89k stars 4.78k forks source link

Kong 2.1.3 peaking CPU Utilization on minimal ingress traffic #6331

Closed jeremyjpj0916 closed 4 years ago

jeremyjpj0916 commented 4 years ago

Summary

Both Stage environments for us in multiple regions are maxing out 6/6 CPU with minimal traffic, including 1 for info:

Stage 2.1.3:

image

Added new ENV configs:

            - name: KONG_UPSTREAM_KEEPALIVE_IDLE_TIMEOUT
              value: '30'
            - name: KONG_UPSTREAM_KEEPALIVE_MAX_REQUESTS
              value: '50000'
            - name: KONG_UPSTREAM_KEEPALIVE_POOL_SIZE
              value: '400'
            - name: KONG_WORKER_CONSISTENCY
              value: eventual
            - name: KONG_WORKER_STATE_UPDATE_FREQUENCY
              value: '5'

At first thought maybe it was due to tls handshaking and poor connection management with the new dynamic keepalive concept but I don't think its that because a netstat showed me only 50-70 timed_wait connections so I think we are reusing TLS connections and such appropriately. Also even if we were not, the traffic levels in stage are too low to warrant this kinda jump on CPU maxing over just extra handshakes.

Balancer section shows:

image

And logging output says:

nginx: [warn] load balancing method redefined in /usr/local/kong/nginx.conf:109

Which makes me think the dynamic upstreams pooling is doing its thing. The old directive timeouts being present in there worry me though, do they just do nothing since I have the KONG_UPSTREAM_* env variables set?

Edit - Also the memory utilization seems suspicious as well. Let me check the lua GC stats a bit.

      location /gc {
        content_by_lua_block {
            ngx.say(string.format("Worker %d: GC size: %.3f KB", ngx.var.pid, collectgarbage("count")))
        }
      }

All right some stats on GC here some quick curls back to back:

/ $ curl localhost:8001/gc
Worker 16357: GC size: 228216.283 KB
/ $ curl localhost:8001/gc
Worker 16356: GC size: 275980.139 KB
/ $ curl localhost:8001/gc
Worker 16353: GC size: 370928.317 KB
/ $ curl localhost:8001/gc
Worker 16352: GC size: 297587.145 KB
/ $ curl localhost:8001/gc
Worker 16353: GC size: 370971.472 KB
/ $ curl localhost:8001/gc
Worker 16357: GC size: 230060.206 KB
/ $ curl localhost:8001/gc
Worker 16356: GC size: 276916.122 KB
/ $ curl localhost:8001/gc
Worker 16356: GC size: 276934.148 KB
/ $ curl localhost:8001/gc
Worker 16357: GC size: 230323.086 KB
/ $ curl localhost:8001/gc
Worker 16353: GC size: 371010.115 KB
/ $ curl localhost:8001/gc
Worker 16356: GC size: 276998.841 KB
/ $ curl localhost:8001/gc
Worker 16356: GC size: 277019.045 KB
/ $ curl localhost:8001/gc
Worker 16354: GC size: 195916.395 KB

Prod 2.0.5:

image

Both environments run same plugins and such.

Edit - adding gc output here too, not actually a mega difference though(although prod does way more traffic in general), well not enough to warrant the mega MEM consumption seen in Stage anyways:

/ $ curl localhost:8001/gc
Worker 89: GC size: 136691.558 KB
/ $ curl localhost:8001/gc
Worker 89: GC size: 137774.161 KB
/ $ curl localhost:8001/gc
Worker 90: GC size: 150929.955 KB
/ $ curl localhost:8001/gc
Worker 89: GC size: 138182.485 KB
/ $ curl localhost:8001/gc
Worker 86: GC size: 122922.341 KB
/ $ curl localhost:8001/gc
Worker 90: GC size: 152281.531 KB
/ $ curl localhost:8001/gc
Worker 86: GC size: 123804.118 KB
/ $ curl localhost:8001/gc
Worker 90: GC size: 154103.277 KB
/ $ curl localhost:8001/gc
Worker 88: GC size: 97121.837 KB
/ $ curl localhost:8001/gc
Worker 90: GC size: 172375.127 KB
/ $ curl localhost:8001/gc
Worker 87: GC size: 167126.220 KB
/ $ curl localhost:8001/gc
Worker 86: GC size: 145722.125 KB
/ $ curl localhost:8001/gc
Worker 90: GC size: 174195.557 KB
/ $ curl localhost:8001/gc

No error logs that stand out much in 2.1.3 environments for this. Seems like a hot path inefficiency or some global mapper rebuilds are being done in a hot path constantly or something of that nature. It has that feel anyways.

Will update this issue as I get more info or if I get any insights. If Kong has any ideas for dropping in heavier debugging info let me know and I will try it too in STG.

Additional Details & Logs

jeremyjpj0916 commented 4 years ago

Just rolling redeployed these peaked STG pods. Going to watch closely the CPU/MEM growth now too today.

hishamhm commented 4 years ago

"CPU 100%" reminded me of this other issue, I wonder if it's the same:

https://github.com/Kong/kong/issues/6178

hishamhm commented 4 years ago

@jeremyjpj0916 do you happen to see anything like this on the logs, as it was reported on that other issue?

events.lua:155: post_event(): worker-events: could not write to shm after 6 tries (no memory)

jeremyjpj0916 commented 4 years ago

Nothing like that. Oddly enough I rolling redeployed all network zones on 2.1.3 and am watching again right now and things are calmed down everywhere, few examples below:

image

image

image

image

image

So right now not seeing it, everything is behaving as expected. But earlier today when looking traffic was all the way down etc. but CPU and MEM were sky high. Will see if something triggers it again somehow over the next 5-6 days. Will certainly probably hold off on prod 2.1.3 for a bit until we understand this a bit better.

jeremyjpj0916 commented 4 years ago

Okay it seems to have done it again, 19:37 I saw it at 3.x CPU utilized and by 20:27 it was at 5.95:

image

image

image

Odd some are dropping after the spike too close to 2:30 AM EST. Likely related to this in the logs:

image

Seen this nil pointer error with rate limiter plugin a few times in the logs too:

image

Other network zone 2 different DC's tied to same db cluster:

image

image

Hmm what happened around 20:00 EST so 8 PM? (a bit before and lasted a bit after). To cause such a mega spike and then subsequently it did drop off.

Did see this in logs too:

image

So server logs in UTC is 5 hours ahead.. So that puts it close to the CPU spike drop off eh.. hmm

Some envs are getting pegged to 100% and persisting, so are getting up there then coming back down but throwing some graceful shutdowns or something... hmmm super strange.

jeremyjpj0916 commented 4 years ago

Wish I had caught it but I saw a flash in top at one point where the worker processes all seemed to crazy % spike on CPU utilization to like 233% each or something.. Also we run with 6 worker processes and I see 6 running and like 6 "dead" ones here too in an kong node, running this top from a pod container shell in OpenShift:

image

Not every env of the spikes shows the 12 (6 running and 6 not) either, seems some show just 6 worker processes too. Well I think they all eventually show 12 after awhile it seems, likely when the workers are signal 17'ing.

2.0.5 environments only show 6 worker processes via top consistently:

image

jeremyjpj0916 commented 4 years ago

Seems when the nodes are pegged to that 100% utilization I am seeing the 12 worker process reference via top vs 6 on 2.1.3 too.

Thinking beyond just runtime Kong if something could trigger it too. We run a CRON daily on 2.0.5 and 2.1.3 that updates a CRL reference and restarts Kong:

# Refresh CRL for MTLS
0 0 0 * * * * wget ${CRL_DOWNLOAD_URL} -O /usr/local/kong/stargate-mutual-trust.crl.pem && kong reload -p "/usr/local/kong" --nginx-conf "/usr/local/kong/optum_nginx.template"

Which would be daily midnight UTC eh? Lets ensure my nodes are UTC which is usually a default.

/ $ date
Sun Sep 13 23:16:46 UTC 2020

UTC being 5 hours ahead of EST and my Openshift pics above from spikes are in EST, so 24:00 - 5 = 19:00(7pm EST) ... I mean the spikes seem to start curing up close to 19:00 but not peaking till after ... I wonder if somehow kong reload on 2.1.x series is the cause, let me try manually executing the reload upon command with this manually and watching utilizations:

kong reload -p "/usr/local/kong" --nginx-conf "/usr/local/kong/optum_nginx.template"

Also for notes, multi_execute.sh just ensures when starting the container we have supercronic running + kong, we do this with 2.0.5 too so it can't be part of the problem but if you look in the top you can see it above:

#!/bin/bash

#Execute both applications in parallel
/usr/local/kong/openresty/nginx/sbin/nginx -c '/usr/local/kong/nginx.conf' -p '/usr/local/kong/' &
P1=$!
supercronic /usr/local/crons/stargate_crons &
P2=$!
wait $P1 $P2

And our Dockerfile ends with a:

CMD /usr/local/kong/multi_execute.sh
jeremyjpj0916 commented 4 years ago

Okay after running my kong reload manually I saw it say reload was successful, in top I see 12 worker processes:

image

Output at time of the manual reload:

image

Watching CPU/MEM now to see if we start skyrocketing.

jeremyjpj0916 commented 4 years ago

YESSS THATS IT ^^^^ ... after reload my pod just goes berserk on CPU!!! Found the issue..

image

Now the question is why..... 2.0.5 has no issues on a running node reload daily every night... why does 2.1.3? The 6 zombie worker processes seem to be a part of that, checking all my 2.0.5 nodes in prod, none exhibit lingering worker processes after a reload like we run. Up to the Kong think tank to take it from here and figure out what might be the culprit.

cc @hishamhm .

jeremyjpj0916 commented 4 years ago

Okay then it seemed to stay like that(12 worker proc present) and pegged CPU for 12 minutes actually.... THEN I see this in logs:

image

And it was around this time top shows the worker processes back from 12 to 6 again. But 12 full minutes of 100% cpu utilization basically after the reload for us.

The pod metrics reflect this too:

image

The remaining question I have also is well why do some pods exhibit the CPU spike to 100% utilization and then never come back down or the old worker processes going away after a undeterminable but can be seemingly forever amount of time?

jeremyjpj0916 commented 4 years ago

I also took the time to find the moment in time where a PROD 2.0.5 kong reload was being executed, see as follows:

time="2020-09-14T00:00:00Z" level=info msg="2020-09-14 00:00:00 (647 MB/s) - '/usr/local/kong/stargate-mutual-trust.crl.pem' saved [3192/3192]" channel=stderr iteration=0 job.command="wget ${CRL_DOWNLOAD_URL} -O /usr/local/kong/stargate-mutual-trust.crl.pem && kong reload -p \"/usr/local/kong\" --nginx-conf \"/usr/local/kong/optum_nginx.template\"" job.position=2 job.schedule="0 0 0 * * * *"
--
  | time="2020-09-14T00:00:00Z" level=info channel=stderr iteration=0 job.command="wget ${CRL_DOWNLOAD_URL} -O /usr/local/kong/stargate-mutual-trust.crl.pem && kong reload -p \"/usr/local/kong\" --nginx-conf \"/usr/local/kong/optum_nginx.template\"" job.position=2 job.schedule="0 0 0 * * * *"
  | 2020/09/14 00:00:00 [notice] 45#0: signal 1 (SIGHUP) received from 87686, reconfiguring
  | 2020/09/14 00:00:00 [notice] 45#0: reconfiguring
  | time="2020-09-14T00:00:00Z" level=info msg="Kong reloaded" channel=stdout iteration=0 job.command="wget ${CRL_DOWNLOAD_URL} -O /usr/local/kong/stargate-mutual-trust.crl.pem && kong reload -p \"/usr/local/kong\" --nginx-conf \"/usr/local/kong/optum_nginx.template\"" job.position=2 job.schedule="0 0 0 * * * *"
  | time="2020-09-14T00:00:00Z" level=info msg="job succeeded" iteration=0 job.command="wget ${CRL_DOWNLOAD_URL} -O /usr/local/kong/stargate-mutual-trust.crl.pem && kong reload -p \"/usr/local/kong\" --nginx-conf \"/usr/local/kong/optum_nginx.template\"" job.position=2 job.schedule="0 0 0 * * * *"
  | 2020/09/14 00:00:00 [notice] 45#0: ModSecurity-nginx v1.0.1 (rules loaded inline/local/remote: 1/948/0)
  | 2020/09/14 00:00:00 [crit] 92#0: *20535450 [lua] init.lua:298: [cluster_events] no 'at' in shm, polling events from: 1600041585.839, context: ngx.timer
  | 2020/09/14 00:00:02 [notice] 45#0: using the "epoll" event method
  | 2020/09/14 00:00:02 [notice] 45#0: start worker processes
  | 2020/09/14 00:00:02 [notice] 45#0: start worker process 87739
  | 2020/09/14 00:00:02 [notice] 45#0: start worker process 87740
  | 2020/09/14 00:00:02 [notice] 45#0: start worker process 87741
  | 2020/09/14 00:00:02 [notice] 45#0: start worker process 87742
  | 2020/09/14 00:00:02 [notice] 45#0: start worker process 87743
  | 2020/09/14 00:00:02 [notice] 45#0: start worker process 87744
  | 2020/09/14 00:00:02 [warn] 87739#0: *20536225 [lua] socket.lua:159: tcp(): no support for cosockets in this context, falling back to LuaSocket, context: init_worker_by_lua*
  | 2020/09/14 00:00:02 [warn] 87741#0: *20536230 [lua] socket.lua:159: tcp(): no support for cosockets in this context, falling back to LuaSocket, context: init_worker_by_lua*
  | 2020/09/14 00:00:02 [warn] 87742#0: *20536232 [lua] socket.lua:159: tcp(): no support for cosockets in this context, falling back to LuaSocket, context: init_worker_by_lua*
  | 2020/09/14 00:00:02 [warn] 87743#0: *20536233 [lua] socket.lua:159: tcp(): no support for cosockets in this context, falling back to LuaSocket, context: init_worker_by_lua*
  | 2020/09/14 00:00:02 [warn] 87740#0: *20536231 [lua] socket.lua:159: tcp(): no support for cosockets in this context, falling back to LuaSocket, context: init_worker_by_lua*
  | 2020/09/14 00:00:02 [warn] 87744#0: *20536234 [lua] socket.lua:159: tcp(): no support for cosockets in this context, falling back to LuaSocket, context: init_worker_by_lua*
  | 2020/09/14 00:00:02 [warn] 87741#0: *20536230 [lua] globalpatches.lua:55: sleep(): executing a blocking 'sleep' (0.001 seconds), context: init_worker_by_lua*
  | 2020/09/14 00:00:02 [warn] 87743#0: *20536233 [lua] globalpatches.lua:55: sleep(): executing a blocking 'sleep' (0.001 seconds), context: init_worker_by_lua*
  | 2020/09/14 00:00:02 [notice] 87739#0: *20536225 [lua] cache_warmup.lua:46: cache_warmup_single_entity(): Preloading 'services' into the core_cache..., context: init_worker_by_lua*
  | 2020/09/14 00:00:02 [warn] 87739#0: *20536225 [lua] socket.lua:159: tcp(): no support for cosockets in this context, falling back to LuaSocket, context: init_worker_by_lua*
  | 2020/09/14 00:00:02 [warn] 87741#0: *20536230 [lua] globalpatches.lua:55: sleep(): executing a blocking 'sleep' (0.002 seconds), context: init_worker_by_lua*
  | 2020/09/14 00:00:02 [notice] 87#0: gracefully shutting down
  | 2020/09/14 00:00:02 [notice] 88#0: gracefully shutting down
  | 2020/09/14 00:00:02 [notice] 89#0: gracefully shutting down
  | 2020/09/14 00:00:02 [notice] 92#0: gracefully shutting down
  | 2020/09/14 00:00:02 [notice] 91#0: gracefully shutting down
  | 2020/09/14 00:00:02 [notice] 45#0: signal 17 (SIGCHLD) received from 87711
  | 2020/09/14 00:00:02 [notice] 90#0: gracefully shutting down
  | 2020/09/14 00:00:02 [notice] 87739#0: *20536225 [lua] cache_warmup.lua:85: cache_warmup_single_entity(): finished preloading 'services' into the core_cache (in 269ms), context: init_worker_by_lua*
  | 2020/09/14 00:00:02 [notice] 87739#0: *20536225 [lua] cache_warmup.lua:46: cache_warmup_single_entity(): Preloading 'plugins' into the core_cache..., context: init_worker_by_lua*
  | 2020/09/14 00:00:02 [warn] 87739#0: *20536225 [lua] socket.lua:159: tcp(): no support for cosockets in this context, falling back to LuaSocket, context: init_worker_by_lua*
  | 2020/09/14 00:00:02 [notice] 89#0: exiting
  | 2020/09/14 00:00:03 [warn] 87739#0: *20536225 [lua] socket.lua:159: tcp(): no support for cosockets in this context, falling back to LuaSocket, context: init_worker_by_lua*
  | 2020/09/14 00:00:03 [notice] 90#0: exiting
  | 2020/09/14 00:00:02 [notice] 89#0: exit
  | 2020/09/14 00:00:03 [notice] 45#0: signal 17 (SIGCHLD) received from 89
  | 2020/09/14 00:00:03 [notice] 45#0: worker process 89 exited with code 0
  | 2020/09/14 00:00:03 [notice] 88#0: exiting
  | 2020/09/14 00:00:03 [warn] 87739#0: *20536225 [lua] socket.lua:159: tcp(): no support for cosockets in this context, falling back to LuaSocket, context: init_worker_by_lua*
  | 2020/09/14 00:00:03 [notice] 92#0: exiting
  | 2020/09/14 00:00:03 [notice] 91#0: exiting
  | 2020/09/14 00:00:03 [notice] 90#0: exit
  | 2020/09/14 00:00:03 [notice] 45#0: signal 17 (SIGCHLD) received from 90
  | 2020/09/14 00:00:03 [notice] 45#0: worker process 90 exited with code 0
  | 2020/09/14 00:00:03 [notice] 88#0: exit
  | 2020/09/14 00:00:03 [notice] 45#0: signal 17 (SIGCHLD) received from 88
  | 2020/09/14 00:00:03 [notice] 45#0: worker process 88 exited with code 0
  | 2020/09/14 00:00:03 [notice] 87739#0: *20536225 [lua] cache_warmup.lua:85: cache_warmup_single_entity(): finished preloading 'plugins' into the core_cache (in 389ms), context: init_worker_by_lua*
  | 2020/09/14 00:00:03 [notice] 87739#0: *20536225 [lua] cache_warmup.lua:46: cache_warmup_single_entity(): Preloading 'consumers' into the cache..., context: init_worker_by_lua*
  | 2020/09/14 00:00:03 [warn] 87739#0: *20536225 [lua] socket.lua:159: tcp(): no support for cosockets in this context, falling back to LuaSocket, context: init_worker_by_lua*
  | 2020/09/14 00:00:03 [warn] 87739#0: *20536225 [lua] socket.lua:159: tcp(): no support for cosockets in this context, falling back to LuaSocket, context: init_worker_by_lua*
  | 2020/09/14 00:00:03 [notice] 91#0: exit
  | 2020/09/14 00:00:03 [notice] 45#0: signal 17 (SIGCHLD) received from 91
  | 2020/09/14 00:00:03 [notice] 45#0: worker process 91 exited with code 0
  | 2020/09/14 00:00:03 [notice] 92#0: exit
  | 2020/09/14 00:00:03 [notice] 45#0: signal 17 (SIGCHLD) received from 92
  | 2020/09/14 00:00:03 [notice] 45#0: worker process 92 exited with code 0
  | 2020/09/14 00:00:03 [warn] 87739#0: *20536225 [lua] socket.lua:159: tcp(): no support for cosockets in this context, falling back to LuaSocket, context: init_worker_by_lua*
  | 2020/09/14 00:00:03 [notice] 87739#0: *20536225 [lua] cache_warmup.lua:85: cache_warmup_single_entity(): finished preloading 'consumers' into the cache (in 104ms), context: init_worker_by_lua*
  | 2020/09/14 00:00:03 [notice] 45#0: signal 29 (SIGIO) received
  | 2020/09/14 00:00:03 [notice] 87739#0: *20536748 [lua] cache_warmup.lua:25: warming up DNS entries ..., context: ngx.timer
  | 2020/09/14 00:00:05 [notice] 87#0: exiting
  | 2020/09/14 00:00:05 [notice] 87#0: exit
  | 2020/09/14 00:00:05 [notice] 45#0: signal 17 (SIGCHLD) received from 87
  | 2020/09/14 00:00:05 [notice] 45#0: worker process 87 exited with code 0
  | 2020/09/14 00:00:05 [notice] 45#0: signal 29 (SIGIO) received

See how the worker processes came up and the old gracefully began shutdown at the same moment in time? Yeah 2.1.3 does not do that for some reason.

I think its also important that OpenResty could also be at play here as a culprit as well:

Kong 2.0.5: OpenResty: 1.15.8.2

Kong 2.1.3: OpenResty: 1.15.8.3

And both use Nginx 1.15.8 so I don't think the web server directly could be to blame here.

So maybe potential way to reproduce this will be run a kong node taking active traffic(well ours do anyways) and then have it hot reload in place with a template etc. Unsure if my specific plugins/services/routes matter. Actually, I can reproduce it in our dev environment too running 2.1.3 which only has like 30 proxies and various global plugins we run etc. too so its not necessarily size and scale of the Kong environment either playing a role. Also dev gets basically health-checks and and functional testing automation performed against it hourly so the traffic is super minimal there so starting to think traffic really doesn't even matter as well.

hishamhm commented 4 years ago

Thanks for all the investigation! We'll try to reproduce it on our end. I guess stress-testing reload on a Kong Cassandra cluster will be a good starting point, but not even clear if the choice of DB matters.

jeremyjpj0916 commented 4 years ago

I would speculate DB does not matter, but I did not spin up a postgres node to confirm on that.

jeremyjpj0916 commented 4 years ago

Just tested and confirmed: https://github.com/Kong/kong/pull/6300 fixes this issue for Kong nodes running in DB mode too. thanks @locao !