Default db_update_frequency config value seems to breaks cache when db dies

scooper91 commented 6 years ago

Hi!

I asked in the Kong Gitter chat about this yesterday (30/11) and were advised to make an issue here.

Summary

When no override for db_update_frequency is provided in kong.conf, killing the underlying database results in an instant 404 when hitting a cached route. Expected behaviour would be for the cache ttl to be honoured.

Kong maintains a cache of this data so that there is no need for a database round-trip while proxying requests

When overriding db_update_frequency in kong.conf, killing the underlying database results in the cache ttl being honoured.

Steps To Reproduce

Using docker-compose to spin up a PostgreSQL database, and a Kong-router instance (using official images)
Add route to proxy any URL
Do not override db_update_frequency
Start both containers
Hit the route (to warm the cache)
Stop (docker stop) the database container
Hit the same route again

Results in 404 as soon as database container is stopped.

Expected behaviour: Route is still cached (for 60 minutes).

Additional Details & Logs

Using Kong:0.11.1 official Docker container.

The endpoint is not configured using any plugins.

Using default db_update_frequency (not set in kong.conf)

[root@60cabb76bbbf /]# kong start -vv
2017/12/01 12:25:07 [verbose] Kong: 0.11.1
2017/12/01 12:25:07 [debug] ngx_lua: 10008
2017/12/01 12:25:07 [debug] nginx: 1011002
2017/12/01 12:25:07 [debug] Lua: LuaJIT 2.1.0-beta2
2017/12/01 12:25:07 [verbose] reading config file at /etc/kong/kong.conf
...
2017/12/01 12:25:07 [debug] database = "postgres"
2017/12/01 12:25:07 [debug] db_cache_ttl = 20
2017/12/01 12:25:07 [debug] db_update_frequency = 5

Overriding db_update_frequency

[root@16637ce1b636 /]# kong start --vv
2017/12/01 12:28:05 [verbose] Kong: 0.11.1
2017/12/01 12:28:05 [debug] ngx_lua: 10008
2017/12/01 12:28:05 [debug] nginx: 1011002
2017/12/01 12:28:05 [debug] Lua: LuaJIT 2.1.0-beta2
2017/12/01 12:28:05 [verbose] reading config file at /etc/kong/kong.conf
...
2017/12/01 12:28:05 [debug] database = "postgres"
2017/12/01 12:28:05 [debug] db_cache_ttl = 20
2017/12/01 12:28:05 [debug] db_update_frequency = 10

We're also seeing different behaviour if we use docker kill rather than docker stop to kill the db container, but we can create a separate issue if necessary.

Let me know if you need any more details.

Thanks!

(@danjamesmay)

bungle commented 6 years ago

@scooper91 thank you for the report. Let us investigate this a bit.

thibaultcha commented 6 years ago

@scooper91 Hi,

Thanks for reporting this. I have investigated, reproduced and narrowed down the issue. Unfortunately, we'll need some time to provide a complete fix for this. The cache is per NGINX worker, and if some never get woken up to process request while the database is up and running, they will eventually once the database is unreachable, except that they won't be able to retrieve the update list.

For now, know that Kong will be more resilient to such databases failure scenarios if you stop and restart Kong once your APIs are added. I will write a more thorough explanation once I have some more time, or maybe a fix.

thibaultcha commented 6 years ago

When I reproduced this on my side, it seemed to boil down to the worker-based nature of NGINX. Each worker maintains its own router in their own Lua VMs, and this router is lazily built (upon the first request) and rebuilt upon configuration changes (invalidation events from other workers or nodes in the cluster).

Since the kernel will distribute workload across the different workers, different requests can (and will) be handled by different workers at times, which means hitting different Lua VMs.

Now let's take a detour into how the router is being built: as of today, we select all APIs at once (making the router build itself incrementally is still on our TODO list), and each worker will query the database on its own and directly build its router in the Lua VM. Usually, we store such database-fetched values in an shared memory zone to avoid cache stampede effects on the database, but we do not in that one particular case to save shared memory and because we initially considered "router rebuilds" to be occasional.

But that is precisely the issue: start a fresh Kong node without a router built. Make a few requests that are handled by one worker (which builds its router). Stop the database. Make a few requests that are handled by another router -> oops, the database is unreachable. Building the router when starting Kong fixes this issue but it does not fix cache invalidation rebuilds.

Hope that helps anybody who is investigating this issue or thinking of a fix - PRs welcome!

bungle commented 5 years ago

I think this PR is a try to fix most of the issues around this: #4102 (it does not implement incremental router rebuilding).

thibaultcha commented 5 years ago

This should have been fixed already by 2 changes:

First, by updating the default value of db_cache_ttl to 0 (https://github.com/Kong/kong/pull/3492), which always was the long term goal (db_cache_ttl was introduced as a safety measure against stale values when we first rolled out the new DB caching implementation). Now, the cache does not expire anymore unless changes are made through the Admin API. It is worth noting that one can already safely make this change to this config value on their side, without upgrading their Kong node to a new version or anything.
Second, by the new "resurrect stale" feature of mlcache (https://github.com/thibaultcha/lua-resty-mlcache/pull/52), which is leveraged by Kong since 0.14.0 and ensures that stale cached values are served when the database is down. Users do need to upgrade Kong to 0.14.0+ to benefit from this.

thibaultcha commented 5 years ago

Oh, I overlooked something, which is that router rebuilds bypassed the cache and always used a 0 TTL value since day one... One of the "quick and easy" solutions I suggested back then was to store Routes and Services in the cache to benefit from those 2 points above, but @bungle's solution is more valid, even without incremental router rebuilds.

Sorry for the noise!

bungle commented 5 years ago

We have reworked this quite a bit with 1.2.x release, so I think I will close this for now. There is also https://github.com/Kong/kong/pull/4903 that will make Kong even more resilient.

Kong / kong