Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/#kong-community
Apache License 2.0
39.12k stars 4.79k forks source link

Plugin_iterator fail #6521

Closed miketwenty1 closed 3 years ago

miketwenty1 commented 3 years ago

Summary

After some time we weren't able to spin up a new kong node as part of a cluster (v2.1.4) - We will get an error. So far wiping the db will let new nodes spin up when we encountered the error. We would like to know what to do if we get this error in the future.

nginx: [error] init_by_lua error: /usr/local/share/lua/5.1/kong/runloop/plugins_iterator.lua:378: table index is nil
stack traceback:
    /usr/local/share/lua/5.1/kong/runloop/plugins_iterator.lua:378: in function 'new'
    /usr/local/share/lua/5.1/kong/runloop/handler.lua:488: in function 'build_plugins_iterator'
    /usr/local/share/lua/5.1/kong/init.lua:497: in function 'init'
    init_by_lua:3: in main chunk

  Run with --v (verbose) or --vv (debug) for more details

Steps To Reproduce

  1. kong start -c /etc/kong/kong.conf
  2. error (above)
  3. profit

Additional Details & Logs

Edit: Wanted to let you know trying to add a kong node running 2.2.0 resulted in the same error. Not sure what in the database got corrupted or how it was corrupted with this index plugin issue. We did have about 50 or so JWT plugin uses.. and 1 prometheus plugin use. Hope that helps.

bungle commented 3 years ago

Just a note here.

In plugins iterator we have this code:

 local ws_id = workspaces.get_workspace_id() or kong.default_workspace
  local ws = {
    [ws_id] = new_ws_data()
  }

The error happens on init phase which means that workspaces.get_workspace_id() will return nil:

function workspaces.get_workspace_id()
  local r = base.get_request()
  if not r then
    return nil
  end

  return ngx.ctx.workspace or kong.default_workspace
end

as there is no request.

This is code that should find the default workspace and where the problem starts:

local default_ws = db.workspaces:select_by_name("default")
kong.default_workspace = default_ws and default_ws.id

local ok, err = runloop.build_plugins_iterator("init")

for some reason selecting the default workspace fails for @miketwenty1.

@miketwenty1, are you running postgres or cassandra?

miketwenty1 commented 3 years ago

@bungle postgres v11.4 through AWS RDS.

bungle commented 3 years ago

@miketwenty1 and can you confirm that new nodes connected to exactly same database? Did you try restarting an old node?

miketwenty1 commented 3 years ago

@bungle yes same db. We had multiple clusters.. each cluster with their own postgres db.. We got the same issue for each cluster environment. Nodes that were already setup weren't getting an error. I never attempted to restart a specific node due to our architecture / autoscaling setup. I imagine killing the kong process and starting it again on a specific node that was working may have resulted in the same error, but I'm unsure. My idea wasn't to stop/hurt the nodes that were working.

Edit: It may be worth mentioning this kong setup was originally spun up on 1.5.0 and we used the migration commands to go to 2.1.4. Which worked, running these commands again output something along the lines of "nothing left to migrate, migration already done". Not sure if this is related. We didn't update the kong old nodes.. we cycled them to 2.1.4. The DB remained the same. But also during this time we implemented JWT plugin and prometheus plugin. So, several variables that may have caused this.

bungle commented 3 years ago

@miketwenty1, did you do an inplace update? Could it be that there were still some of the old Kong code laying around, or Kong dependencies, like Luarocks modules? Could this explain it?

miketwenty1 commented 3 years ago

@bungle i may have been unlcear.. or not sure what you mean by "inplace" update. The kong nodes were running when we ran the db migration.

We didn't do anything special with dependencies. We used https://github.com/Kong/kong-terraform-aws this initially deploy kong to 1.5.0.

We brought on new nodes at the time running 2.1.4 into the cluster as we cycled away old 1.5.0 nodes. 2.1.4 nodes failed.. wiping the database fixed our issue.. and 2.1.4 nodes rebuilt the backend tables and such. (if my memory servers me correctly). essentially the db migration didn't work the way we did it. We were trying to do zero downtime.

bungle commented 3 years ago

@miketwenty1 ,

Ok, so you started fresh nodes "side-by-side" and not updated the running nodes. This is better.

Do you think it would be beneficial to still try to track this down? Could it be perhaps this: 1) you migrated but that didn't work (fully?) 2) you started the node and got the error?

It could be a default workspace creation issue, but we have never seen that with pg. Perhaps some other migration failed on pg. But as you said, the migrations said it was done. Just thinking of a strategy on how to reproduce.

miketwenty1 commented 3 years ago

I brought up this issue for the benefit of kong and to try to help.. it's up to you if you think it warrants to be tracked. My main concern was this may happen again if we switch to another version of kong.

Also note at the time I believe we had 3 separate kong clusters.. in different environments.. all suffered the same issue.. and all had the same fix (wiping the db and starting over).

bungle commented 3 years ago

@miketwenty1,

I know, and that's why we are trying to figure out what is going on. But so far we have not been able reproduce it, and are running out of ideas. All new information is greatly appreciated. E.g. do you have those environments or databases still somewhere?

And what you did is:

  1. started a new version of kong on new node
  2. ran migrations up with the new version
  3. started new replacement nodes with a new version
  4. stopped the old nodes
  5. ran migrations finish with the new version

And what version was the old and what was the new?

miketwenty1 commented 3 years ago

we didn't do step 4. (maybe this is why it failed for us)? we were attempting zero downtime and wanted the new node (v2.1.4) to work before taking down all of the older nodes (v1.5.0).

version 1.5.0 -> 2.1.4

Edit: Q: do you have those environments or databases still somewhere A: No, we don't have any of the old stuff from v1.5.0

javierguerragiraldez commented 3 years ago

If I get this correctly, you managed to solve your problem and posted this issue to help other users. Since we weren't able to reproduce, I'm closing now. Feel free to reopen if you think my reading is wrong or if the problem persists.

jeremyjpj0916 commented 3 years ago

Weird... I just did a day old db rollback test(C* rollback to yesterdays data, no kong version change or anything). And I get now on startup:

2021/09/29 21:07:22 [error] 45#0: init_by_lua error: ...luarocks/share/lua/5.1/kong/runloop/plugins_iterator.lua:388: table index is nil
--
  | stack traceback:
  | ...luarocks/share/lua/5.1/kong/runloop/plugins_iterator.lua:388: in function 'new'
  | ...cal/kong/luarocks/share/lua/5.1/kong/runloop/handler.lua:511: in function 'build_plugins_iterator'
  | /usr/local/kong/luarocks/share/lua/5.1/kong/init.lua:519: in function 'init'
  | init_by_lua:3: in main chunk
  | nginx: [error] init_by_lua error: ...luarocks/share/lua/5.1/kong/runloop/plugins_iterator.lua:388: table index is nil
  | stack traceback:
  | ...luarocks/share/lua/5.1/kong/runloop/plugins_iterator.lua:388: in function 'new'
  | ...cal/kong/luarocks/share/lua/5.1/kong/runloop/handler.lua:511: in function 'build_plugins_iterator'
  | /usr/local/kong/luarocks/share/lua/5.1/kong/init.lua:519: in function 'init'
  | init_by_lua:3: in main chunk

I checked if migrations up or finish would do anything further but they reported all was done...

/ $ kong version --vv
2021/09/29 21:11:16 [verbose] Kong: 2.4.1
2021/09/29 21:11:16 [debug] ngx_lua: 10019
2021/09/29 21:11:16 [debug] nginx: 1019003
2021/09/29 21:11:16 [debug] Lua: LuaJIT 2.1.0-beta3

schema meta:

kongdba@cqlsh> use kong_dev;
kongdba@cqlsh:kong_dev> select * from schema_meta;

 key         | subsystem             | executed                                                                                                                                                                                                                                          | last_executed  | pending
-------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+---------
 schema_meta |                   acl |                                                                                                                                                            {'000_base_acl', '001_14_to_15', '002_130_to_140', '003_200_to_210', '004_212_to_213'} | 004_212_to_213 |    null
 schema_meta |            basic-auth |                                                                                                                                                                                                           {'000_base_basic_auth', '001_14_to_15'} |   001_14_to_15 |    null
 schema_meta |                  core | {'000_base', '001_14_to_15', '002_15_to_1', '003_100_to_110', '004_110_to_120', '005_120_to_130', '006_130_to_140', '007_140_to_150', '008_150_to_200', '009_200_to_210', '010_210_to_211', '011_212_to_213', '012_213_to_220', '013_220_to_230'} | 013_220_to_230 |    null
 schema_meta |             hmac-auth |                                                                                                                                                                                                            {'000_base_hmac_auth', '001_14_to_15'} |   001_14_to_15 |    null
 schema_meta |                   jwt |                                                                                                                                                                              {'000_base_jwt', '001_14_to_15', '002_130_to_140', '003_200_to_210'} | 003_200_to_210 |    null
 schema_meta |              key-auth |                                                                                                                                                                         {'000_base_key_auth', '001_14_to_15', '002_130_to_140', '003_200_to_210'} | 003_200_to_210 |    null
 schema_meta |                oauth2 |                                                                                                                                         {'000_base_oauth2', '001_14_to_15', '002_15_to_10', '003_130_to_140', '004_200_to_210', '005_210_to_211'} | 005_210_to_211 |    null
 schema_meta |         rate-limiting |                                                                                                                                                     {'000_base_rate_limiting', '001_14_to_15', '002_15_to_10', '003_10_to_112', '004_200_to_210'} | 004_200_to_210 |    null
 schema_meta | response-ratelimiting |                                                                                                                                                                               {'000_base_response_rate_limiting', '001_14_to_15', '002_15_to_10'} |   002_15_to_10 |    null

And workspace table:

kongdba@cqlsh:kong_dev> select * from workspaces;

 id                                   | comment | config | created_at                      | meta | name
--------------------------------------+---------+--------+---------------------------------+------+---------
 374dc773-9bf3-482e-921f-9cd6b5ffe15d |    null |   null | 2020-08-14 05:28:21.000000+0000 | null | default

Example plugin:

 id                                   | api_id | cache_key                                                                                | config                                                                                                                                                                                                                                                         | consumer_id | created_at                      | enabled | name | protocols                          | route_id                             | run_on | service_id | tags | ws_id
--------------------------------------+--------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+---------------------------------+---------+------+------------------------------------+--------------------------------------+--------+------------+------+--------------------------------------
 718c6469-712b-4d72-8d02-ab88fcb32501 |   null | plugins:jwt:9db9377f-c67a-4a6d-ab56-2e2dda0ffd73::::374dc773-9bf3-482e-921f-9cd6b5ffe15d | {"secret_is_base64":false,"anonymous":"f6b0f623-8ce8-4b66-9fab-f3cb153ec444","key_claim_name":"iss","claims_to_verify":["exp"],"maximum_expiration":3600,"uri_param_names":["jwt"],"cookie_names":[],"run_on_preflight":true,"header_names":["authorization"]} |        null | 2021-08-23 14:37:15.000000+0000 |    True |  jwt | {'grpc', 'grpcs', 'http', 'https'} | 9db9377f-c67a-4a6d-ab56-2e2dda0ffd73 |   null |       null | null | 374dc773-9bf3-482e-921f-9cd6b5ffe15d

ws_id's are in sync.

jeremyjpj0916 commented 3 years ago

Ugh... https://github.com/Kong/kong/blob/2.4.1/kong/runloop/plugins_iterator.lua#L388 more of that not being able to derive the default workspace crap... oh boy wonder why this time.

Maybe similar related to workspace id things from the past: https://github.com/Kong/kong/issues/6403 .. hmmm ghosts again.

jeremyjpj0916 commented 3 years ago

I added some debugging too:

2.4.1 runloop/plugins_iterator.lua

function PluginsIterator.new(version)
  if not version then
    error("version must be given", 2)
  end

  loaded_plugins = loaded_plugins or get_loaded_plugins()
  enabled_plugins = enabled_plugins or kong.configuration.loaded_plugins

  kong.log.err("Kong kong.default_workspace: " .. tostring(kong.default_workspace))
  kong.log.err("Kong workspaces.get_workspace_id(): " .. tostring(workspaces.get_workspace_id()))

  local ws_id = workspaces.get_workspace_id() or kong.default_workspace
  kong.log.err("Kong ws_id variable: " .. tostring(ws_id))
  local ws = {
    [ws_id] = new_ws_data()
  }

Output from the print statements before the prior error table index nil stack-trace occurs:

nginx: [error] [kong] plugins_iterator.lua:386 Kong kong.default_workspace: nil
2021/09/30 08:02:13 [error] 45#0: [kong] plugins_iterator.lua:387 Kong workspaces.get_workspace_id(): nil
nginx: [error] [kong] plugins_iterator.lua:387 Kong workspaces.get_workspace_id(): nil
2021/09/30 08:02:13 [error] 45#0: [kong] plugins_iterator.lua:390 Kong ws_id variable: nil
nginx: [error] [kong] plugins_iterator.lua:390 Kong ws_id variable: nil

WS ID is set on all the C* cluster nodes associated. I ran a keyspace repair as well but that didn't help either. May wanna make a totally separate issue than the original one if we don't have a quick idea. I swear we hit this kinda stuff before and you tweaked WS ID lookup stuff... why is it happening again and only after I did a data restoration of the existing keyspace from a backup taken 2 days ago. I tried a backup from yesterday as well, same result. So odd....

@bungle any ideas here?

jeremyjpj0916 commented 3 years ago

And that same code logging deployed into another network zone of dev region we haven't done a db restore from backup on... of course when staratup occurs there the output looks like this:

2021/09/30 08:01:03 [error] 45#0: [kong] plugins_iterator.lua:386 Kong kong.default_workspace: 7e175ef7-c50d-49d1-a648-0091156e2b4c
--
  | nginx: [error] [kong] plugins_iterator.lua:386 Kong kong.default_workspace: 7e175ef7-c50d-49d1-a648-0091156e2b4c
  | 2021/09/30 08:01:03 [error] 45#0: [kong] plugins_iterator.lua:387 Kong workspaces.get_workspace_id(): nil
  | nginx: [error] [kong] plugins_iterator.lua:387 Kong workspaces.get_workspace_id(): nil
  | 2021/09/30 08:01:03 [error] 45#0: [kong] plugins_iterator.lua:390 Kong ws_id variable: 7e175ef7-c50d-49d1-a648-0091156e2b4c
  | nginx: [error] [kong] plugins_iterator.lua:390 Kong ws_id variable: 7e175ef7-c50d-49d1-a648-0091156e2b4c

Which is fine since the default workspace is found even if the workspaces.get_workspace_id() call fails w nil. Why does the region that had a db backup restore Kong isn't seeing the ws id data correctly even if I do select * from workspaces on each node and it loads fine...

the-erwin commented 1 year ago

Hi,

While migrating our datastore from Cassandra to Postgress I've encountered the same error: 2023/06/30 07:46:00 [error] 1#0: init_by_lua error: /usr/local/share/lua/5.1/kong/runloop/plugins_iterator.lua:458: table index is nil stack traceback: /usr/local/share/lua/5.1/kong/runloop/plugins_iterator.lua:458: in function 'new' /usr/local/share/lua/5.1/kong/runloop/handler.lua:676: in function 'build_plugins_iterator' /usr/local/share/lua/5.1/kong/init.lua:643: in function 'init' init_by_lua:3: in main chunk

In my case I've exported data from table workspaces from Cassandra and imported in Postgress, the culprit were spaces surrounding default in column name of table workspaces: Postgress data leading to fail to start kong: kong=# select * from kong_int.workspaces; id | name | comment | created_at |... 8380c691-0e23-40ed-863c-8573e8c61261 | default | null | 2021-05-04 16:32:59+00 |...

Postgress data leading to start kong correctly: kong=# select * from kong_int.workspaces; id | name | comment | created_at |... 8380c691-0e23-40ed-863c-8573e8c61261 | default | null | 2021-05-04 16:32:59+00 |...

Hope this helps anyone finding a solution for this error ;-)