Closed cb372 closed 8 years ago
Just a few questions to better understand the context:
nginx
configuration property in kong.yml
?It's the Kong .deb package, running on Ubuntu Vivid. Nginx config is unchanged - the only thing I touched in kong.yml is the Cassandra hostnames.
I've added the key-auth and mashape-analytics plugins via the API. I'm also using rate limiting. There are a few thousand consumers registered, each with a rate limit and one or two keys.
Let me know if there's any more helpful info I can provide.
It is likely to me that the analytics plugin cannot buffer the amount of data at this pace and thus is causing the crash. This is similar to openresty/lua-nginx-module#148 and openresty/lua-nginx-module#230 which propose ways of investigating the issue.
For now:
conf.delay
and conf.batch_size
. Eventually consider lowering those values so the buffers are being flushed more often.error.log
saying so, and data in your Galileo dashboard?I am not sure it is the reason but it is what seems more likely to me.
I am not sure this can be qualified as a bug since memory limitations are, granted, optimizable, but ultimately quite unavoidable. This needs more investigation (especially finding out if it can be optimized), before being qualified as so. The last question (do you see batches being sent from Kong to Galileo? Do you see logs in error.log saying so, and data in your Galileo dashboard?
) is definitely a crucial point to know what is the cause of the problem here.
If we confirm the problem is the Galileo in-memory batch growing too big, this could be qualified as an enhancement for the plugin and a solution could be dumping the buffer on disk when it gets bigger than a specific size (kinda to how nginx buffers request/response bodies to file if they are too big).
@thibaultCha My theory at the moment is, yes, it's the analytics plugin. I turned it off for a few hours today and didn't see a single error of this kind. I turned it on again just before I went home, so I'll check the logs again tomorrow and see if the errors came back.
are you performing requests with bodies and are those being logged by the analytics plugin?
No, body logging is disabled, and the requests are all GETs anyway
how often are you emptying the buffers? conf.delay and conf.batch_size
100 events and 2 seconds (i.e. the default values)
how many APIs are you hitting in your tests?
Just 1 API
do you see batches being sent from Kong to Galileo? Do you see logs in error.log saying so, and data in your Galileo dashboard?
I don't have Kong's debug logs enabled, so it's hard to see exactly when it is sending batches. The Galileo dashboard does show data, but it has huge gaps of an hour or so at a time.
If I'm reading the source of the analytics plugin right, the size of a single batch is limited but the queue of batches waiting to be sent is unbounded. This would explain the out-of-memory issue, if for any reason Kong was temporarily unable to connect to the Analytics server.
@thefosk 's suggestion of spooling to disk sounds good, but it also sounds quite complex and thus unlikely to be implemented for a while. As a first step I would be quite happy with:
I'm much happier losing analytics data for 100 events in a controlled manner than having Kong run out of memory, blow up without warning, and lose the data for the entire waiting-to-send queue.
Even if you don't have debug logs enabled, the error.log file will have errors logs when batches were not acknowledged by Galileo. It would be very useful here to know if any of those happen, and what is the status code returned by Galileo. Those informations are available in the logs. Currently, batches enqueued are dropped if Galileo is responding with a 400 status code, but it might have been changed on Galileo's side and other status codes could now be used, causing an invalid batch to not be dropped. This could be one reason.
Having a limit in the queue of batches is also a possibility, but it might not be needed if the error is coming from un-dropped invalid batches.
I've just checked error.log again and found some errors coming from the plugin.
There are quite a few of these per day, but much fewer than the number of Lua VM crashed, reason: not enough memory
logs.
2015/11/28 00:16:30 [error] 8894#0: [lua] buffer.lua:191: [mashape-analytics] socket server could not save the batch (100 ALFs 301424 bytes). Status: (504) Error: (), context: ngx.timer, client: 54.170.138.214, server: 0.0.0.0:8000
2015/11/28 00:54:33 [error] 8894#0: [lua] buffer.lua:191: [mashape-analytics] socket server could not save the batch (100 ALFs 313776 bytes). Status: (504) Error: (), context: ngx.timer, client: 54.216.32.247, server: 0.0.0.0:8000
2015/11/28 02:18:39 [error] 8894#0: [lua] buffer.lua:191: [mashape-analytics] socket server could not save the batch (100 ALFs 313838 bytes). Status: (504) Error: (), context: ngx.timer, client: 54.78.132.159, server: 0.0.0.0:8000
These also appeared, but only 8 of them over one 5 minute period:
2015/11/30 07:43:54 [error] 14338#0: [lua] buffer.lua:191: [mashape-analytics] socket server could not save the batch (100 ALFs 303971 bytes). Status: (207) Error: (Valid ALFs: 0/100 saved
), context: ngx.timer, client: 54.78.132.159, server: 0.0.0.0:8000
2015/11/30 07:44:10 [error] 14338#0: [lua] buffer.lua:191: [mashape-analytics] socket server could not save the batch (100 ALFs 299382 bytes). Status: (207) Error: (Valid ALFs: 0/100 saved
), context: ngx.timer, client: 54.216.251.254, server: 0.0.0.0:8000
2015/11/30 07:44:41 [error] 14338#0: [lua] buffer.lua:191: [mashape-analytics] socket server could not save the batch (100 ALFs 304055 bytes). Status: (207) Error: (Valid ALFs: 0/100 saved
), context: ngx.timer, client: 54.155.39.94, server: 0.0.0.0:8000
Is it possible that the Analytics server is not returning (many) error responses but is instead taking a long time to respond to the requests? That would be consistent with the behaviour we are seeing, i.e. out-of-memory errors + not many error logs. The 504 responses being returned also suggest that the Analytics server might be struggling to handle the load.
Sorry to repeat myself, but I strongly recommend having a bound on that queue. Relying on the HTTP response code is not enough - if the analytics server responds slowly (as I suspect it is doing), or if it goes down for a few minutes, as it inevitably will now and again, then the queue will start to grow.
It's a simple matter of fault isolation: auxiliary functionality such as the analytics plugin should not be able to cause any adverse effect on the main functionality of Kong, no matter what happens. And an out-of-memory error crashing the whole process is the mother of all adverse effects :)
I would really appreciate a quick fix for this. I was planning to put Kong into production this week, but I won't be able to do that, at least not with the analytics plugin enabled, until this issue is fixed. And since we are paying for Galileo, disabling the analytics would be a real shame.
I said "it might" not be needed, in the sense that the issue could have not existed if Kong was handling 207
. However, yes, it was already my intention to implement a cap on that queue after your first suggestion.
I have done so, as well as other improvements for the buffer in #757, which is opened against master as a hotfix. I am hoping to finish testing it soon, so we can release a new version (0.5.4
) in the next day(s).
Awesome, thanks for the quick response!
I believe we can close this now that #757 is merged and since 0.5.4
is coming (#767). Thanks for reporting.
I'm running Kong 0.5.3 on
c4.large
EC2 boxes, putting a reasonable amount of traffic (about 120 req/s per Kong machine) through it. CPU and memory usage seem pretty low, but I'm getting messages like the following in Kong's error log.Is this a sign that Kong is overloaded? If so, do I need to spin up more machines? Or is there some tuning I can do in
kong.yml
, e.g. to give more memory to nginx?There is loads of free memory available on the box:
The nginx processes look like this in
top
: