Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/#kong-community
Apache License 2.0
39.24k stars 4.81k forks source link

Lua VM crashed, reason: not enough memory #750

Closed cb372 closed 8 years ago

cb372 commented 8 years ago

I'm running Kong 0.5.3 on c4.large EC2 boxes, putting a reasonable amount of traffic (about 120 req/s per Kong machine) through it. CPU and memory usage seem pretty low, but I'm getting messages like the following in Kong's error log.

Is this a sign that Kong is overloaded? If so, do I need to spin up more machines? Or is there some tuning I can do in kong.yml, e.g. to give more memory to nginx?

nginx: lua atpanic: Lua VM crashed, reason: not enough memory
*** stack smashing detected ***: nginx: worker process terminated
2015/11/27 15:47:41 [alert] 1205#0: worker process 23453 exited on signal 6 (core dumped)
nginx: lua atpanic: Lua VM crashed, reason: not enough memory
2015/11/27 16:15:14 [error] 28058#0: [lua] buffer.lua:53: [mashape-analytics] failed to create delayed batch sending timer: process exiting, context: ngx.timer, client: 54.216.32.247, server: 0.0.0.0:8000
2015/11/27 16:15:14 [error] 28058#0: [lua] init_worker.lua:12: create_timer(): [reports] failed to create timer: process exiting, context: ngx.timer
nginx: lua atpanic: Lua VM crashed, reason: not enough memory

There is loads of free memory available on the box:

ubuntu@ip-10-248-76-173:~$ free -h
             total       used       free     shared    buffers     cached
Mem:          3.7G       1.1G       2.5G       6.0M       102M       351M
-/+ buffers/cache:       718M       3.0G
Swap:           0B         0B         0B

The nginx processes look like this in top:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
29678 nobody    20   0  245516  61112   5032 S  11.3  1.6   0:22.35 nginx
28958 nobody    20   0  728236 542236   5340 S   3.0 14.1   3:02.60 nginx
subnetmarco commented 8 years ago

Just a few questions to better understand the context:

cb372 commented 8 years ago

It's the Kong .deb package, running on Ubuntu Vivid. Nginx config is unchanged - the only thing I touched in kong.yml is the Cassandra hostnames.

I've added the key-auth and mashape-analytics plugins via the API. I'm also using rate limiting. There are a few thousand consumers registered, each with a rate limit and one or two keys.

Let me know if there's any more helpful info I can provide.

thibaultcha commented 8 years ago

It is likely to me that the analytics plugin cannot buffer the amount of data at this pace and thus is causing the crash. This is similar to openresty/lua-nginx-module#148 and openresty/lua-nginx-module#230 which propose ways of investigating the issue.

For now:

I am not sure it is the reason but it is what seems more likely to me.

thibaultcha commented 8 years ago

I am not sure this can be qualified as a bug since memory limitations are, granted, optimizable, but ultimately quite unavoidable. This needs more investigation (especially finding out if it can be optimized), before being qualified as so. The last question (do you see batches being sent from Kong to Galileo? Do you see logs in error.log saying so, and data in your Galileo dashboard?) is definitely a crucial point to know what is the cause of the problem here.

subnetmarco commented 8 years ago

If we confirm the problem is the Galileo in-memory batch growing too big, this could be qualified as an enhancement for the plugin and a solution could be dumping the buffer on disk when it gets bigger than a specific size (kinda to how nginx buffers request/response bodies to file if they are too big).

cb372 commented 8 years ago

@thibaultCha My theory at the moment is, yes, it's the analytics plugin. I turned it off for a few hours today and didn't see a single error of this kind. I turned it on again just before I went home, so I'll check the logs again tomorrow and see if the errors came back.

are you performing requests with bodies and are those being logged by the analytics plugin?

No, body logging is disabled, and the requests are all GETs anyway

how often are you emptying the buffers? conf.delay and conf.batch_size

100 events and 2 seconds (i.e. the default values)

how many APIs are you hitting in your tests?

Just 1 API

do you see batches being sent from Kong to Galileo? Do you see logs in error.log saying so, and data in your Galileo dashboard?

I don't have Kong's debug logs enabled, so it's hard to see exactly when it is sending batches. The Galileo dashboard does show data, but it has huge gaps of an hour or so at a time.

If I'm reading the source of the analytics plugin right, the size of a single batch is limited but the queue of batches waiting to be sent is unbounded. This would explain the out-of-memory issue, if for any reason Kong was temporarily unable to connect to the Analytics server.

@thefosk 's suggestion of spooling to disk sounds good, but it also sounds quite complex and thus unlikely to be implemented for a while. As a first step I would be quite happy with:

  1. Make the batches queue bounded, with the max size configurable
  2. If the queue is full, just drop the batch on the floor and log a warning

I'm much happier losing analytics data for 100 events in a controlled manner than having Kong run out of memory, blow up without warning, and lose the data for the entire waiting-to-send queue.

thibaultcha commented 8 years ago

Even if you don't have debug logs enabled, the error.log file will have errors logs when batches were not acknowledged by Galileo. It would be very useful here to know if any of those happen, and what is the status code returned by Galileo. Those informations are available in the logs. Currently, batches enqueued are dropped if Galileo is responding with a 400 status code, but it might have been changed on Galileo's side and other status codes could now be used, causing an invalid batch to not be dropped. This could be one reason.

Having a limit in the queue of batches is also a possibility, but it might not be needed if the error is coming from un-dropped invalid batches.

cb372 commented 8 years ago

I've just checked error.log again and found some errors coming from the plugin.

There are quite a few of these per day, but much fewer than the number of Lua VM crashed, reason: not enough memory logs.

2015/11/28 00:16:30 [error] 8894#0: [lua] buffer.lua:191: [mashape-analytics] socket server could not save the batch (100 ALFs 301424 bytes). Status: (504) Error: (), context: ngx.timer, client: 54.170.138.214, server: 0.0.0.0:8000
2015/11/28 00:54:33 [error] 8894#0: [lua] buffer.lua:191: [mashape-analytics] socket server could not save the batch (100 ALFs 313776 bytes). Status: (504) Error: (), context: ngx.timer, client: 54.216.32.247, server: 0.0.0.0:8000
2015/11/28 02:18:39 [error] 8894#0: [lua] buffer.lua:191: [mashape-analytics] socket server could not save the batch (100 ALFs 313838 bytes). Status: (504) Error: (), context: ngx.timer, client: 54.78.132.159, server: 0.0.0.0:8000

These also appeared, but only 8 of them over one 5 minute period:

2015/11/30 07:43:54 [error] 14338#0: [lua] buffer.lua:191: [mashape-analytics] socket server could not save the batch (100 ALFs 303971 bytes). Status: (207) Error: (Valid ALFs: 0/100 saved
), context: ngx.timer, client: 54.78.132.159, server: 0.0.0.0:8000
2015/11/30 07:44:10 [error] 14338#0: [lua] buffer.lua:191: [mashape-analytics] socket server could not save the batch (100 ALFs 299382 bytes). Status: (207) Error: (Valid ALFs: 0/100 saved
), context: ngx.timer, client: 54.216.251.254, server: 0.0.0.0:8000
2015/11/30 07:44:41 [error] 14338#0: [lua] buffer.lua:191: [mashape-analytics] socket server could not save the batch (100 ALFs 304055 bytes). Status: (207) Error: (Valid ALFs: 0/100 saved
), context: ngx.timer, client: 54.155.39.94, server: 0.0.0.0:8000

Is it possible that the Analytics server is not returning (many) error responses but is instead taking a long time to respond to the requests? That would be consistent with the behaviour we are seeing, i.e. out-of-memory errors + not many error logs. The 504 responses being returned also suggest that the Analytics server might be struggling to handle the load.

cb372 commented 8 years ago

Sorry to repeat myself, but I strongly recommend having a bound on that queue. Relying on the HTTP response code is not enough - if the analytics server responds slowly (as I suspect it is doing), or if it goes down for a few minutes, as it inevitably will now and again, then the queue will start to grow.

It's a simple matter of fault isolation: auxiliary functionality such as the analytics plugin should not be able to cause any adverse effect on the main functionality of Kong, no matter what happens. And an out-of-memory error crashing the whole process is the mother of all adverse effects :)

I would really appreciate a quick fix for this. I was planning to put Kong into production this week, but I won't be able to do that, at least not with the analytics plugin enabled, until this issue is fixed. And since we are paying for Galileo, disabling the analytics would be a real shame.

thibaultcha commented 8 years ago

I said "it might" not be needed, in the sense that the issue could have not existed if Kong was handling 207. However, yes, it was already my intention to implement a cap on that queue after your first suggestion.

I have done so, as well as other improvements for the buffer in #757, which is opened against master as a hotfix. I am hoping to finish testing it soon, so we can release a new version (0.5.4) in the next day(s).

cb372 commented 8 years ago

Awesome, thanks for the quick response!

thibaultcha commented 8 years ago

I believe we can close this now that #757 is merged and since 0.5.4 is coming (#767). Thanks for reporting.