Closed scooper91 closed 5 years ago
@scooper91 thank you for the report. Let us investigate this a bit.
@scooper91 Hi,
Thanks for reporting this. I have investigated, reproduced and narrowed down the issue. Unfortunately, we'll need some time to provide a complete fix for this. The cache is per NGINX worker, and if some never get woken up to process request while the database is up and running, they will eventually once the database is unreachable, except that they won't be able to retrieve the update list.
For now, know that Kong will be more resilient to such databases failure scenarios if you stop and restart Kong once your APIs are added. I will write a more thorough explanation once I have some more time, or maybe a fix.
When I reproduced this on my side, it seemed to boil down to the worker-based nature of NGINX. Each worker maintains its own router in their own Lua VMs, and this router is lazily built (upon the first request) and rebuilt upon configuration changes (invalidation events from other workers or nodes in the cluster).
Since the kernel will distribute workload across the different workers, different requests can (and will) be handled by different workers at times, which means hitting different Lua VMs.
Now let's take a detour into how the router is being built: as of today, we select all APIs at once (making the router build itself incrementally is still on our TODO list), and each worker will query the database on its own and directly build its router in the Lua VM. Usually, we store such database-fetched values in an shared memory zone to avoid cache stampede effects on the database, but we do not in that one particular case to save shared memory and because we initially considered "router rebuilds" to be occasional.
But that is precisely the issue: start a fresh Kong node without a router built. Make a few requests that are handled by one worker (which builds its router). Stop the database. Make a few requests that are handled by another router -> oops, the database is unreachable. Building the router when starting Kong fixes this issue but it does not fix cache invalidation rebuilds.
Hope that helps anybody who is investigating this issue or thinking of a fix - PRs welcome!
I think this PR is a try to fix most of the issues around this: #4102 (it does not implement incremental router rebuilding).
This should have been fixed already by 2 changes:
db_cache_ttl
to 0
(https://github.com/Kong/kong/pull/3492), which always was the long term goal (db_cache_ttl
was introduced as a safety measure against stale values when we first rolled out the new DB caching implementation). Now, the cache does not expire anymore unless changes are made through the Admin API. It is worth noting that one can already safely make this change to this config value on their side, without upgrading their Kong node to a new version or anything.Oh, I overlooked something, which is that router rebuilds bypassed the cache and always used a 0
TTL value since day one... One of the "quick and easy" solutions I suggested back then was to store Routes and Services in the cache to benefit from those 2 points above, but @bungle's solution is more valid, even without incremental router rebuilds.
Sorry for the noise!
We have reworked this quite a bit with 1.2.x
release, so I think I will close this for now. There is also https://github.com/Kong/kong/pull/4903 that will make Kong even more resilient.
Hi!
I asked in the Kong Gitter chat about this yesterday (30/11) and were advised to make an issue here.
Summary
When no override for
db_update_frequency
is provided inkong.conf
, killing the underlying database results in an instant 404 when hitting a cached route. Expected behaviour would be for the cache ttl to be honoured.When overriding
db_update_frequency
inkong.conf
, killing the underlying database results in the cache ttl being honoured.Steps To Reproduce
db_update_frequency
docker stop
) the database containerResults in 404 as soon as database container is stopped.
Expected behaviour: Route is still cached (for 60 minutes).
Additional Details & Logs
Using Kong:0.11.1 official Docker container.
The endpoint is not configured using any plugins.
Using default
db_update_frequency
(not set inkong.conf
)Overriding
db_update_frequency
We're also seeing different behaviour if we use
docker kill
rather thandocker stop
to kill the db container, but we can create a separate issue if necessary.Let me know if you need any more details.
Thanks!
(@danjamesmay)