Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/#kong-community
Apache License 2.0
39.02k stars 4.79k forks source link

graceful reload: kong reload on a machine with reasonable traffic and a go plugin leads to 404 and DOS #8531

Closed diefans closed 1 year ago

diefans commented 2 years ago

Is there an existing issue for this?

Kong version ($ kong version)

2.7.1

Current Behavior

When just reloading kong while the host receives reasonable traffic passed through a go plugin, all responses after reload have a status==404 rendering the service useless.

The error log shows:

kong_1               | 192.168.33.1 - - [10/Mar/2022:17:34:35 +0000] "GET /auth/version HTTP/1.1" 404 396 "-" "-"
kong_1               | 2022/03/10 17:34:35 [error] 3442#0: *1453547 [kong] pb_rpc.lua:365 [bm-authx] closed, client: 192.168.33.1, server: kong, request: "GET /auth/version HTTP/1.1", host: "example.com"
kong_1               | 192.168.33.1 - - [10/Mar/2022:17:34:35 +0000] "GET /auth/version HTTP/1.1" 404 396 "-" "-"
kong_1               | 2022/03/10 17:34:35 [error] 3442#0: *1453556 [kong] pb_rpc.lua:365 [bm-authx] closed, client: 192.168.33.1, server: kong, request: "GET /auth/version HTTP/1.1", host: "example.com"
kong_1               | 192.168.33.1 - - [10/Mar/2022:17:34:35 +0000] "GET /auth/version HTTP/1.1" 404 396 "-" "-"
kong_1               | 2022/03/10 17:34:35 [error] 3442#0: *1453548 [kong] pb_rpc.lua:365 [bm-authx] closed, client: 192.168.33.1, server: kong, request: "GET /auth/version HTTP/1.1", host: "example.com"

Expected Behavior

kong reloads gracefully without notice to a client.

Steps To Reproduce

1. involve go plugin route in a loadtest
2. reload kong e.g. `kong reload`

Anything else?

We believe, this is related to how kong is managing recreating its workers and the go plugin process, e.g. binding new workers to the old plugin process and stopping it the latter.

diefans commented 2 years ago

It seems I am able to "fix" the big 404 issue by resetting the plugin instance:

diff --git a/kong/runloop/plugin_servers/pb_rpc.lua b/kong/runloop/plugin_servers/pb_rpc.lua
index 38f3aafe7..ae4531cdd 100644
--- a/kong/runloop/plugin_servers/pb_rpc.lua
+++ b/kong/runloop/plugin_servers/pb_rpc.lua
@@ -396,7 +396,7 @@ function Rpc:handle_event(plugin_name, conf, phase)
   if not res then
     kong.log.err(err)

-    if string.match(err:lower(), "no plugin instance") then
+    if string.match(err:lower(), "no plugin instance") or err:lower() == "closed" then
       self.reset_instance(plugin_name, conf)
       return self:handle_event(plugin_name, conf, phase)
     end

But still the reload is not graceful - at least only 3 requests fail due to connection closed, probably related to those 3 workers affected:

20002:kong_1                | 2022/03/14 16:36:24 [error] 1195#0: *21628 [kong] pb_rpc.lua:397 [bm-authx] closed, client: 192.168.33.1, server: kong, request: "GET /auth/version HTTP/1.1",host: "example.com"
20003:kong_1                | 2022/03/14 16:36:24 [error] 1195#0: *21630 [kong] pb_rpc.lua:397 [bm-authx] closed, client: 192.168.33.1, server: kong, request: "GET /auth/version HTTP/1.1",host: "example.com"
20005:kong_1                | 2022/03/14 16:36:24 [error] 1195#0: *21624 [kong] pb_rpc.lua:397 [bm-authx] closed, client: 192.168.33.1, server: kong, request: "GET /auth/version HTTP/1.1",host: "example.com"
❯ cargo run --release --example goose -- --host https://example.com/auth/version -t 10 --no-reset-metrics
    Finished release [optimized] target(s) in 0.14s
     Running `target/release/examples/goose --host 'https://example.com/auth/version' -t 10 --no-reset-metrics`
16:36:23 [WARN] "/auth/version": error sending request for url (https://example.com/auth/version): connection closed before message completed
16:36:23 [WARN] "/auth/version": error sending request for url (https://example.com/auth/version): connection closed before message completed
16:36:23 [WARN] "/auth/version": error sending request for url (https://example.com/auth/version): connection closed before message completed
All 8 users hatched.

 === PER TASK METRICS ===
 ------------------------------------------------------------------------------
 Name                     |   # times run |        # fails |   task/s |  fail/s
 ------------------------------------------------------------------------------
 1: LoadtestTasks         |
   1:                     |             8 |         0 (0%) |     0.80 |    0.00
   2:                     |        68,858 |         0 (0%) |     6886 |    0.00
 -------------------------+---------------+----------------+----------+--------
 Aggregated               |        68,866 |         0 (0%) |     6887 |    0.00
 ------------------------------------------------------------------------------
 Name                     |    Avg (ms) |        Min |         Max |     Median
 ------------------------------------------------------------------------------
 1: LoadtestTasks         |
   1:                     |       11.12 |          7 |          14 |         10
   2:                     |        1.13 |          1 |         105 |          1
 -------------------------+-------------+------------+-------------+-----------
 Aggregated               |        1.13 |          1 |         105 |          1

 === PER REQUEST METRICS ===
 ------------------------------------------------------------------------------
 Name                     |        # reqs |        # fails |    req/s |  fail/s
 ------------------------------------------------------------------------------
 GET /auth/version        |        68,858 |         3 (0%) |     6886 |    0.30
 ------------------------------------------------------------------------------
 Name                     |    Avg (ms) |        Min |         Max |     Median
 ------------------------------------------------------------------------------
 GET /auth/version        |        1.11 |          1 |         105 |          1
 ------------------------------------------------------------------------------
 Slowest page load within specified percentile of requests (in ms):
 ------------------------------------------------------------------------------
 Name                     |    50% |    75% |    98% |    99% |  99.9% | 99.99%
 ------------------------------------------------------------------------------
 GET /auth/version        |      1 |      1 |      4 |      5 |     11 |     34

 === ERRORS ===
 ------------------------------------------------------------------------------
 Count       | Error
 ------------------------------------------------------------------------------
 3             GET /auth/version: error sending request /auth/version: connection closed before message completed
 ------------------------------------------------------------------------------
 ------------------------------------------------------------------------------
 Users: 8
 Target host: https://example.com/auth/version
 Starting: 2022-03-14 17:36:17 - 2022-03-14 17:36:25 (duration: 00:00:08)
 Running:  2022-03-14 17:36:25 - 2022-03-14 17:36:35 (duration: 00:00:10)
 Stopping: 2022-03-14 17:36:35 - 2022-03-14 17:36:35 (duration: 00:00:00)

 goose v0.15.2
 ------------------------------------------------------------------------------
HairyMike commented 2 years ago

I think we're hitting this on kong version 2.8.0. Slightly different line number:

2022/03/30 01:04:30 [error] 26036#0: *1644465 [kong] pb_rpc.lua:397 [rtoken] closed, client: 172.16.80.250, server: kong, request: "GET ***redacted*** HTTP/1.1", host: "20.89.27.110"

https://github.com/Kong/kong/blob/master/kong/runloop/plugin_servers/pb_rpc.lua#L397

diefans commented 2 years ago

...still testing....

kong reached the following state, after continuous reloading under load, with the above patch

Mem: 30873812K used, 1917048K free, 222636K shrd, 2792924K buff, 11316512K cached
CPU:   5% usr   1% sys   0% nic  93% idle   0% io   0% irq   0% sirq
Load average: 0.65 2.36 3.38 5/2240 1959
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
 1477     1 kong     S     711m   2%   4   0% /go-plugins/bm-authx -kong-prefix /usr/local/ko
 1504     1 kong     S     419m   1%   2   0% nginx: worker process
 1370     1 kong     S     711m   2%   5   0% /go-plugins/bm-authx -kong-prefix /usr/local/ko
 1231     1 kong     S     711m   2%   2   0% /go-plugins/bm-authx -kong-prefix /usr/local/ko
 1421     1 kong     S     711m   2%   3   0% /go-plugins/bm-authx -kong-prefix /usr/local/ko
 1509  1499 kong     S     711m   2%   5   0% /go-plugins/bm-authx -kong-prefix /usr/local/ko
 1499     1 kong     S     450m   1%   6   0% nginx: worker process
 1501     1 kong     S     438m   1%   1   0% nginx: worker process
 1505     1 kong     S     419m   1%   4   0% nginx: worker process
 1502     1 kong     S     419m   1%   6   0% nginx: worker process
 1500     1 kong     S     419m   1%   2   0% nginx: worker process
 1506     1 kong     S     419m   1%   4   0% nginx: worker process
 1503     1 kong     S     419m   1%   2   0% nginx: worker process
    1     0 kong     S     381m   1%   1   0% nginx: master process /usr/local/openresty/ngin
 1189     0 kong     S     2592   0%   5   0% bash
 1520  1189 kong     R     1600   0%   0   0% top

So there are now a couple of plugin processes hanging around... all but one lost their worker process.

diefans commented 2 years ago

I created https://github.com/bettermarks/kong_8531 to easy reproduce this issue.

Also interesting - JavaScript plugins fail even worse, while Python plugins seem to repair itself and nearly do not mess up.

The 404 from the issue title is a by-product of the plugin being skipped/omitted due to its failure, where the request reached the upstream in a unintended way. So this issue is related also to https://github.com/Kong/kong/issues/8546.

mayocream commented 1 year ago

Fixed by https://github.com/Kong/kong/pull/8923

chithanh12 commented 1 year ago

I think we're hitting this on kong version 2.8.0. Slightly different line number:

2022/03/30 01:04:30 [error] 26036#0: *1644465 [kong] pb_rpc.lua:397 [rtoken] closed, client: 172.16.80.250, server: kong, request: "GET ***redacted*** HTTP/1.1", host: "20.89.27.110"

https://github.com/Kong/kong/blob/master/kong/runloop/plugin_servers/pb_rpc.lua#L397

I got the same issue but don't know how to fixed it. I use the latest version of go pdk. How did you solve it?