Closed jeremyjpj0916 closed 6 years ago
When Kong starts it initializes the lua-resty-worker-events
library for cross-worker events. This is done in the init_worker
phase. That library also posts an event by itself, a started
event. So in this case there are 8 workers, so there are 8 started
events. And usually they get ID 1 through 8.
As we can see in the logs, those are the ones that get lost. Since Kong itself does not use those, this ends up being harmless.
Question is why this happens?
The events library is set to poll every 1 second, and retain events for 5 seconds. See https://github.com/Kong/kong/blob/0.13.1/kong/init.lua#L212-L213
Polling for events is done in a timer. The problem now occurs because while the timer gets scheduled in the init_worker
phase, it does not actually run. The init_worker
phase cannot yield, hence the timers will only be executed when the phase is completed.
18:45:53
18:45:58/59
So the initialization takes more than 5 seconds, and hence some of the events will have been lost by the time the timer finally executes the poll for events. And hence we see the dropping event
errors.
So far I've only seen it with Cassandra as a datastore. So my guess is that the initialization of the dao is the culprit (of the slow init_worker
phase).
Possible solutions:
init_worker
phase@thibaultcha any other ideas?
@Tieske I think the explanation is on point and makes sense, but I do not see how the init worker could take more than 5s to execute without yielding...
There is no initialization with Cassandra in init_worker
- that is used by the PostgreSQL strategy for its TTL implementation.
This is going to be hard to debug without a reproducible case... Luckily it is not harmful.
Oh, it just struck me that init_worker
does not yield anyway. It is interesting that the last two workers do not observe this error as well.
If either of you would like me to drop in a modified core file into my local build and let you know the behavior in our dev environment I can tell you if anything "fixes" the problem. Just show me a diff 😄 . Then again if its truly harmless I don't want you to waste your time on it, just figured if it happens for others getting these statements to stop will save people asking the same thing over again later.
A culprit I have in mind so far would be some blocking I/O somewhere (e.g. using LuaSocket) thus preventing the nginx event loop from updating its time, and never firing the timer until the init_worker
phase is over. But that is just a theory.
@jeremyjpj0916 Do you have any plugin executing code in their init_worker
phase?
@thibaultcha Sorry to be dense on the matter, but can you give me an example plugin Kong currently has that does so? What am I looking for exactly? I run a global variant of the http-log plugin Kong offers, I just format the data to go into Splunk so no real difference there. I also do use the unmodified statsD plugin Kong has globally. And I have noticed lines of code in the statsD plugin that had me scratching my head like so:
https://github.com/Kong/kong/blob/master/kong/plugins/statsd/handler.lua
local function log(premature, conf, message)
if premature then
return
end
Is this code related to ensuring not to execute until a certain phase has been achieved?
Note our Kong nodes do have load balancers in front of them with health checks calling into Kong so maybe one of those global log plugins getting triggered too early causes this? My health-check endpoint is just a dummy proxy with the request termination plugin enabled on it to return 200's.
@jeremyjpj0916 the code you are referring to is a callback for a timer. The premature flag indicates whether the timer ran prematurely, which only happens when the Nginx worker is exiting (basically it says "I'm not executing this timer, but cancelling it because of a shutdown/reload")
@jeremyjpj0916
@Tieske and I had a chat and we talked about this issue. @Tieske will make changes to the worker-events library that should make those errors disappear. Blocking I/O within init_worker
is considered fine by us (the core and/or plugins should be able to do it, and many already do), so we believe the fix has to be in the library instead.
Sorry to be dense on the matter, but can you give me an example plugin Kong currently has that does so?
I am not sure if any of the open source plugins do so already, but if you have a custom plugin that implements the init_worker
phase, and that plugin does a request, or accesses the database via the DAO, then that is what I am referring to. Do you? If not, then it might be I/O done by the core. But regardless, we very much intend to preserve that possibility.
None of the plugins we built use the DB, nor am I familiar with how to write a plugin that implements the init_worker phase. Generally if I write a plugin I pick one of the Kong open source plugins that has a somewhat similar codebase to go on and modify from there so its doubtful, as I learn the application better, understanding what phase they run in and how to utilize different phases will probably come in handy. So I am guessing it may be I/O done by the core. I also have kubernetes readiness probes with exec health checks calling the kong health cli command, generally notice the health-check call to fail twice too before Kong is officially up w its kong health returning, the command is considered successful if its exit code is 0:
readinessProbe:
exec:
command:
- kong
- health
failureThreshold: 2
initialDelaySeconds: 20
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 3
Maybe its the readiness probe calls doing it via cli commands? Tried studying here - https://github.com/Kong/kong/blob/master/kong/cmd/health.lua but I don't see anything specifically doing what you mention would trigger it @thibaultcha. Would be easy to test and reproduce that locally though if Kong runs a Kubernetes cloud testing env and sets up a readiness probe as well that hits the CLI kong health command prior to Kong being up and healthy. One thing to note here because I am sure you are thinking no way Kong takes 20 seconds to boot up, I had to push that initial delay out because our Dynatrace agent firing up initially takes some extra time as well, otherwise my readiness probes kept cutting my pods in a constant redeploy loop thinking pod was in error state.
Tagging this as "bug" for bug tracker organizational purposes, even though this is pretty harmless and will be fixed in the worker-events library. This will be tracked in kong/lua-resty-worker-events#9 but for the mean time, leaving this open until the dependency bump with the fix lands in Kong next
.
Commented on the PR but I fully believe this to be resolved once merged 👍 .
Fixed with #3443, thanks for reporting @jeremyjpj0916!
Summary
Seeing some Errors on Kong start up briefly for a second, does not seem to concerning but I brought it up in forum and Teiski wanted full logs so here you have them:
We run Kong in 2 DC's w/ Cassandra cluster. Kong runs fine after those prints have occurred no issue but thought to bring it up in case there may be something Kong needs to optimize or tweak in the phases of the app.
Original forum post: https://discuss.konghq.com/t/any-cause-for-concerns-here/879
@Tieske Let me know if you would like any further info!