Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/#kong-community
Apache License 2.0
38.78k stars 4.77k forks source link

Signal 17 (SIGCHLD) received in irregular intervals #1551

Closed cmeury closed 7 years ago

cmeury commented 8 years ago

Summary

I upgraded from 0.8.3 to 0.9.0 and everything seems good. APIs work, cluster state is fine (single node anyway). But in the /var/lib/kong/logs/error.log I find many, many entries like this:

2016/08/25 18:16:46 [notice] 27008#0: signal 17 (SIGCHLD) received
2016/08/25 18:16:47 [notice] 27008#0: signal 17 (SIGCHLD) received
...

I checked system journal, dmesg, all kind of logs, not sure where this is coming from. Ideas or pointers?

Additional Details & Logs

thibaultcha commented 8 years ago

Kong 0.9 upgraded the underlying OpenResty version to 1.9.15.1, and by consequence, the Nginx core version to 1.9, so that might be a cause. Are you using any custom Nginx module other than the ones provided by OpenResty? Are you using a custom Nginx configuration with Kong? In order to debug this, could you recompile OpenResty with ./configure --with-debug?

Thanks

cmeury commented 8 years ago

Can confirm, nginx version: openresty/1.9.15.1. I don't use a custom nginx configuration and I did not install any custom Nginx modules.

I'll look into the --debug option, need a large time window to do that.

thenayr commented 8 years ago

Encountering this same behavior with vanilla Kong 0.9.0.

Running via docker-kong docker image.

halliej commented 7 years ago

Seeing the same issue with 0.9.1 running on an AWS Centos 7 AMI. Postgresql datastore. No custom code.

doughuang168 commented 7 years ago

Seeing the same issue with 0.9.1 docker container, running on an DigitalOcean Centos 7 and Ubuntu 16.04. Cassandra datastore. No custom code.

BrianHutchison commented 7 years ago

We're seeing this as well. We're on 0.9.2. [notice] 121#0: signal 17 (SIGCHLD) received

Happens during a time when our script pulls all consumers, APIs, etc. Each time we see about 300 of these.

Tieske commented 7 years ago

@BrianHutchison

when our script pulls all

So this happens when requesting data from the management api? Any further clues to narrow it down?

BrianHutchison commented 7 years ago

@Tieske

We master our data elsewhere and then sync it with Kong every half hour via a script. The script takes about 6 minutes to run on average.

The script pulls /apis, /consumers, /plugins from Kong and compares it with the values in our own service. So, generally its just three GETs of each collection, then small updates when there are diffs such as when someone has added a consumer or changed a setting.

Typically we see 309 such events each time the script runs. Will happen even when the script is doing nothing more than reads.

However, its more than just the script sometimes, since I see now that at about noon today (UTC) we had over 20,000 such events logged within 2 hours. I see no request volume correlation to that event. Some CPU spikes, but we had higher CPU spikes earlier in the day.

BrianHutchison commented 7 years ago

Interestingly, today the number of events has about doubled on a standard basis: e.g. 618 when there were 309 in a typical hour previously. Note that this is across a cluster of 8 containers.

I'm seeing a fair number of these events as well: 016/10/07 16:03:34 [warn] 111#0: [lua] reports.lua:23: log_error(): [reports] failed to acquire ping lock: timeout, context: ngx.timer

thibaultcha commented 7 years ago

This could be caused by one of our bindings or usage of the LuaJIT FFI. In order to debug this, as said before, we would need you to recompile Nginx in debug mode.

To do so, download the same OpenResty version as the one in use for your Kong installation:

$ nginx -v
nginx version: openresty/1.9.15.1

Once downloaded, recompile it with the necessary Kong options, plus the debug mode enabled:

$ ./configure \
  --prefix=/path/to/prefix \
  --with-openssl=/path/to/openssl \
  --with-pcre-jit \
  --with-ipv6 \
  --with-http_realip_module \
  --with-http_ssl_module \
  --with-http_stub_status_module \
  --with-debug
$ make
$ [sudo] make install

Find more information in case of trouble on the OpenResty installation instructions. The prefix used by the Kong packages is /usr/local/openresty. You will need PCRE and OpenSSL too.

Verify your installation with:

$ nginx -V

Which should contain the --with-debug option.

Additionally, set log_level = debug in kong.conf or use KONG_LOG_LEVEL=debug as an ENV variable.

Start or restart Kong as you would usually do, as long as the only nginx binary is the one previously compiled with the debug option.

This could already give us some more information as to what is going on and where.

If this doesn't give us any meaningful insight, it would be wise to generate a core dump get the the crash stacktrace with gdb. See how to do so at https://www.nginx.com/resources/wiki/start/topics/tutorials/debugging/

Thanks!

Edit: prefix is /usr/local/openresty and not /usr/local/kong.

BrianHutchison commented 7 years ago

We will try to replicate in a lab environment first. Not sure I want to run nginx in debug mode in production with the volume & priority of traffic we have right now. We will discuss on our end, and I appreciate your help as always.

thibaultcha commented 7 years ago

If you happen to test this, please give us a minimum reproducible example of how to trigger this (including your OS version and version of the dependencies installed, such as PCRE and OpenSSL).

BrianHutchison commented 7 years ago

The heart of our docker file is basically:

FROM centos:7

ENV KONG_VERSION 0.9.2

RUN yum install -y wget https://github.com/Mashape/kong/releases/download/$KONG_VERSION/kong-$KONG_VERSION.el7.noarch.rpm && \
    yum clean all

We're seeing it in our lab env, so checking there too.

For reproducing steps, I'd try a script that adds a few thousand consumers, about 100 APIs, and ACLs to match, such that it finishes in a few minutes.

BrianHutchison commented 7 years ago

Just wanted to call out that we are seeing several orders of magnitude more of these events just after midnight UTC (which is server timezone). e.g., 350-800 events per hour, then spike to over 12000 per hour for two hours, then back down. No idea if this is helpful, but its another data point.

BrianHutchison commented 7 years ago

Using debug mode and a slightly newer version of openresty, here are some sample log lines around these events:

2016/10/13 23:33:31 [notice] 109#0: signal 17 (SIGCHLD) received
2016/10/13 23:33:31 [info] 109#0: waitpid() failed (10: No child processes)
2016/10/13 23:33:31 [debug] 109#0: *695 [lua] content_by_lua(nginx.conf:156):8: Loading Admin API endpoints
2016/10/13 23:33:31 [info] 109#0: *695 client 127.0.0.1 closed keepalive connection
2016/10/13 23:34:02 [notice] 109#0: signal 17 (SIGCHLD) received
2016/10/13 23:34:02 [info] 109#0: waitpid() failed (10: No child processes)
2016/10/13 23:34:02 [debug] 109#0: *698 [lua] content_by_lua(nginx.conf:156):8: Loading Admin API endpoints
2016/10/13 23:34:02 [info] 109#0: *698 client 127.0.0.1 closed keepalive connection
2016/10/13 23:34:32 [notice] 109#0: signal 17 (SIGCHLD) received
2016/10/13 23:34:32 [info] 109#0: waitpid() failed (10: No child processes)
2016/10/13 23:34:33 [debug] 109#0: *700 [lua] content_by_lua(nginx.conf:156):8: Loading Admin API endpoints
2016/10/13 23:34:33 [info] 109#0: *700 client 127.0.0.1 closed keepalive connection

Note, this is just on my dev box in a docker container.

thibaultcha commented 7 years ago

We would need the backtrace to debug this, there is too little information in there.

rudijs commented 7 years ago

Hi,

Same issue for me using the latest Docker image (Kong 0.9.3)

2016/10/26 01:45:28 [notice] 86#0: signal 17 (SIGCHLD) received
2016/10/26 01:45:28 [notice] 86#0: signal 17 (SIGCHLD) received
2016/10/26 01:45:31 [notice] 86#0: signal 17 (SIGCHLD) received
2016/10/26 01:45:31 [notice] 86#0: signal 17 (SIGCHLD) received
2016/10/26 01:45:34 [notice] 86#0: signal 17 (SIGCHLD) received
2016/10/26 01:45:34 [notice] 86#0: signal 17 (SIGCHLD) received
2016/10/26 01:45:37 [notice] 86#0: signal 17 (SIGCHLD) received
2016/10/26 01:45:37 [notice] 86#0: signal 17 (SIGCHLD) received
2016/10/26 01:45:40 [notice] 86#0: signal 17 (SIGCHLD) received

Nothing custom, just using the built in features.

mihaur commented 7 years ago

Hi,

we are getting the same errors using kong 0.9.3 on ubuntu 16.04 with custom nginx configuration. Our errors are correlated to POST /cluster/events calls

error.log

2016/10/27 09:35:03 [notice] 2354#0: signal 17 (SIGCHLD) received
2016/10/27 09:35:03 [info] 2354#0: waitpid() failed (10: No child processes)
2016/10/27 09:35:03 [info] 2354#0: *75203 client 127.0.0.1 closed keepalive connection
2016/10/27 09:35:33 [notice] 2354#0: signal 17 (SIGCHLD) received
2016/10/27 09:35:33 [info] 2354#0: waitpid() failed (10: No child processes)
2016/10/27 09:35:33 [info] 2354#0: *75216 client 127.0.0.1 closed keepalive connection
2016/10/27 09:36:03 [notice] 2354#0: signal 17 (SIGCHLD) received
2016/10/27 09:36:03 [info] 2354#0: waitpid() failed (10: No child processes)
2016/10/27 09:36:03 [info] 2354#0: *75217 client 127.0.0.1 closed keepalive connection
2016/10/27 09:36:33 [notice] 2354#0: signal 17 (SIGCHLD) received
2016/10/27 09:36:33 [info] 2354#0: waitpid() failed (10: No child processes)
2016/10/27 09:36:33 [info] 2354#0: *75233 client 127.0.0.1 closed keepalive connection
2016/10/27 09:37:03 [notice] 2354#0: signal 17 (SIGCHLD) received
2016/10/27 09:37:03 [info] 2354#0: waitpid() failed (10: No child processes)
2016/10/27 09:37:03 [info] 2354#0: *75234 client 127.0.0.1 closed keepalive connection

access.log

127.0.0.1 - - [27/Oct/2016:09:35:03 +0200] "POST /cluster/events/ HTTP/1.1" 200 5 "-" "lua-resty-http/0.08 (Lua) ngx_lua/10005"
127.0.0.1 - - [27/Oct/2016:09:35:33 +0200] "POST /cluster/events/ HTTP/1.1" 200 5 "-" "lua-resty-http/0.08 (Lua) ngx_lua/10005"
127.0.0.1 - - [27/Oct/2016:09:36:03 +0200] "POST /cluster/events/ HTTP/1.1" 200 5 "-" "lua-resty-http/0.08 (Lua) ngx_lua/10005"
127.0.0.1 - - [27/Oct/2016:09:36:33 +0200] "POST /cluster/events/ HTTP/1.1" 200 5 "-" "lua-resty-http/0.08 (Lua) ngx_lua/10005"
127.0.0.1 - - [27/Oct/2016:09:37:03 +0200] "POST /cluster/events/ HTTP/1.1" 200 5 "-" "lua-resty-http/0.08 (Lua) ngx_lua/10005"
JingWZ commented 7 years ago

Same issue in kong 0.9.1 (OpenResty 1.9.15.1) on CentOS 6.5 It occurs almost every api calls and leads to another problem. As the nginx worker process is killed and created, randomseed() is always called outside the init_worker.

JingWZ commented 7 years ago

Hi,

We got some nginx logs when using debug mode.

This is a piece when calling GET /apis

2016/10/28 10:05:06 [debug] 24375#0: timer delta: 501 2016/10/28 10:05:06 [debug] 24375#0: worker cycle 2016/10/28 10:05:06 [debug] 24375#0: accept mutex lock failed: 0 2016/10/28 10:05:06 [debug] 24375#0: epoll timer: 500 2016/10/28 10:05:06 [debug] 24375#0: timer delta: 501 2016/10/28 10:05:06 [debug] 24375#0: worker cycle 2016/10/28 10:05:06 [debug] 24375#0: accept mutex lock failed: 0 2016/10/28 10:05:06 [debug] 24375#0: epoll timer: 500 2016/10/28 10:05:07 [debug] 24375#0: epoll: fd:6 ev:0001 d:00007F225F686550 2016/10/28 10:05:07 [debug] 24375#0: _5 http keepalive handler 2016/10/28 10:05:07 [debug] 24375#0: 5 malloc: 00007F2269346250:1024 2016/10/28 10:05:07 [debug] 24375#0: 5 recv: fd:6 378 of 1024 2016/10/28 10:05:07 [debug] 24375#0: 5 reusable connection: 0 2016/10/28 10:05:07 [debug] 24375#0: 5 posix_memalign: 00007F2269344B90:4096 @16 2016/10/28 10:05:07 [debug] 24375#0: 5 event timer del: 6: 1477620377249 2016/10/28 10:05:07 [debug] 24375#0: 5 http process request line 2016/10/28 10:05:07 [debug] 24375#0: 5 http request line: "GET /apis HTTP/1.1" 2016/10/28 10:05:07 [debug] 24375#0: 5 http uri: "/apis" 2016/10/28 10:05:07 [debug] 24375#0: 5 http args: "" 2016/10/28 10:05:07 [debug] 24375#0: 5 http exten: "" 2016/10/28 10:05:07 [debug] 24375#0: 5 http process request header line 2016/10/28 10:05:07 [debug] 24375#0: 5 http header: "Host: 192.168.60.22:8001" 2016/10/28 10:05:07 [debug] 24375#0: 5 http header: "Connection: keep-alive" 2016/10/28 10:05:07 [debug] 24375#0: 5 http header: "Cache-Control: no-cache" 2016/10/28 10:05:07 [debug] 24375#0: 5 http header: "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36" 2016/10/28 10:05:07 [debug] 24375#0: 5 http header: "Postman-Token: 4427c953-7cb6-2867-2c95-2bea4fcef6e0" 2016/10/28 10:05:07 [debug] 24375#0: 5 http header: "Accept: /_" 2016/10/28 10:05:07 [debug] 24375#0: 5 posix_memalign: 00007F226930E9E0:4096 @16 2016/10/28 10:05:07 [debug] 24375#0: 5 http header: "Accept-Encoding: gzip, deflate, sdch" 2016/10/28 10:05:07 [debug] 24375#0: 5 http header: "Accept-Language: zh-CN,zh;q=0.8,en;q=0.6" 2016/10/28 10:05:07 [debug] 24375#0: 5 http header done 2016/10/28 10:05:07 [debug] 24375#0: 5 generic phase: 0 2016/10/28 10:05:07 [debug] 24375#0: 5 rewrite phase: 1 2016/10/28 10:05:07 [debug] 24375#0: 5 rewrite phase: 2 2016/10/28 10:05:07 [debug] 24375#0: 5 test location: "/" 2016/10/28 10:05:07 [debug] 24375#0: 5 test location: "robots.txt" 2016/10/28 10:05:07 [debug] 24375#0: 5 test location: "nginx_status" 2016/10/28 10:05:07 [debug] 24375#0: 5 using configuration "/" 2016/10/28 10:05:07 [debug] 24375#0: 5 http cl:-1 max:10485760 2016/10/28 10:05:07 [debug] 24375#0: 5 rewrite phase: 4 2016/10/28 10:05:07 [debug] 24375#0: 5 rewrite phase: 5 2016/10/28 10:05:07 [debug] 24375#0: 5 post rewrite phase: 6 2016/10/28 10:05:07 [debug] 24375#0: 5 generic phase: 7 2016/10/28 10:05:07 [debug] 24375#0: 5 generic phase: 8 2016/10/28 10:05:07 [debug] 24375#0: 5 generic phase: 9 2016/10/28 10:05:07 [debug] 24375#0: 5 access phase: 10 2016/10/28 10:05:07 [debug] 24375#0: 5 access phase: 11 2016/10/28 10:05:07 [debug] 24375#0: 5 access phase: 12 2016/10/28 10:05:07 [debug] 24375#0: 5 lua access handler, uri:"/apis" c:1 2016/10/28 10:05:07 [debug] 24375#0: 5 post access phase: 13 2016/10/28 10:05:07 [debug] 24375#0: 5 lua content handler, uri:"/apis" c:1 2016/10/28 10:05:07 [debug] 24375#0: 5 add cleanup: 00007F226930EBB8 2016/10/28 10:05:07 [debug] 24375#0: 5 lua creating new vm state 2016/10/28 10:05:07 [debug] 24375#0: 5 lua initializing lua registry 2016/10/28 10:05:07 [debug] 24375#0: 5 lua initializing lua globals 2016/10/28 10:05:07 [debug] 24375#0: 5 lua initialize the global Lua VM 0000000041480378 2016/10/28 10:05:07 [debug] 24375#0: 5 malloc: 00007F22692B3240:16 2016/10/28 10:05:07 [notice] 24375#0: signal 17 (SIGCHLD) received 2016/10/28 10:05:07 [info] 24375#0: waitpid() failed (10: No child processes) 2016/10/28 10:05:07 [notice] 24375#0: signal 17 (SIGCHLD) received 2016/10/28 10:05:07 [info] 24375#0: waitpid() failed (10: No child processes) 2016/10/28 10:05:07 [notice] 24375#0: signal 17 (SIGCHLD) received 2016/10/28 10:05:07 [info] 24375#0: waitpid() failed (10: No child processes)

kong_0.9.1_debug_mode_logs_for_signal_17.txt

thibaultcha commented 7 years ago

Update: I was able to reproduce this on CentOS 7.2, and I traced the problem and it is cause by that exact line: https://github.com/Mashape/kong/blob/master/kong/serf.lua#L38

Which will ultimately call os.execute() from the pl.compat Penlight module.

Our errors are correlated to POST /cluster/events calls

@mihaur 's hunch was correct about those errors being correlated to /cluster/events calls.

I will now try to create a minimal example to reproduce this in a raw OpenResty instance to see what is causing this exactly. It is worth noting that os.execute() should ideally not be called in OpenResty's content_by_lua phase anyways.

thibaultcha commented 7 years ago

Okay, so after being able to reproduce on CentOS 7.2 with a simple:

location / {
    content_by_lua_block {
        os.execute()
        ngx.say("ok")
    }
}

I think it is safe to assume that this log is harmless. The nginx worker is receiving a SIGCHLD because the shell process opened by os.execute() terminates, and signaling so to the worker. You'll also notice that the logging level of those logs is notice and info, and not error or higher.

Please ignore these logs. Ideally, they will go away once we avoid calling os.execute() from Nginx, which is blocking I/O.

kapil26021994 commented 7 months ago

@thibaultcha can you tell me where we have to put above code?