Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/#kong-community
Apache License 2.0
39.17k stars 4.8k forks source link

rate-limiting error Redis timeout #13473

Closed liukaiyuliukaiyu closed 3 weeks ago

liukaiyuliukaiyu commented 2 months ago

Is there an existing issue for this?

Kong version ($ kong version)

3.4.0

Current Behavior

Kong uses Redis to limit the service flow, network topology: kong is deployed in elastic cloud containers, and redis cluster is a closed-loop call in the same computer room, and does not involve cross-room and cross-dedicated lines rate-limiting plugin config = {"redis_database":0,"policy":"redis","redis_host":"xx.xx.xx.xx","redis_timeout":50,"limit_by":"server","second":500,"redis_port":xx,"redis_password":"","fault_tolerant":true} redis cluster configuration: 12G memory, 3 proxies, 12 redis, of which redis is one master and one slave, that is, 6 groups of redis Kong Machine specification 4C8G, The container environment only has test traffic, without any other requests, and uses the wrk tool to send requests in parallel image

Expected Behavior

kong can read redis responses normally instead of timeout

Steps To Reproduce

  1. network topology: kong is deployed in elastic cloud containers, and redis cluster is a closed-loop call in the same computer room, and does not involve cross-room and cross-dedicated lines redis cluster configuration: 12G memory, 3 proxies, 12 redis, of which redis is one master and one slave, that is, 6 groups of redis Kong Machine specification 4C8G, The container environment only has test traffic, without any other requests, and uses the wrk tool to send requests in parallel
  2. rate-limiting plugin config = {"redis_database":0,"policy":"redis","redis_host":"xx.xx.xx.xx","redis_timeout":50,"limit_by":"server","second":500,"redis_port":xx,"redis_password":"","fault_tolerant":true}
  3. ./wrk -t12 -c50 -d1s http://127.0.0.1:8000/xxx
  4. Redis: timeout image

Anything else?

no

chobits commented 2 months ago

Could you use a Redis client tool to check if Redis is operational during the time you're benchmarking Kong?

Based on the information you provided, we can't determine where the timeout issue is occurring—it could be on the Kong side or the Redis side.

( BTW, providing the full text of the Redis error log is better for the community to debug. Your screenshot cuts off part of one line and is incomplete

chobits commented 2 months ago

Please note that to effectively benchmark Kong, you need to identify and reach a bottleneck in your system. Only then can you obtain meaningful results, such as QPS from Kong. We'd also like to know what the bottleneck is during your benchmark—it could be 100% CPU usage by Kong, full network card utilization, or 100% CPU usage by Redis.

git-hulk commented 2 months ago

I believe it's impossible to overload the Redis instance with several hundred RPS. Perhaps you should check the number of workers in your Kong instance as @chobits mentioned.

liukaiyuliukaiyu commented 2 months ago

Please note that to effectively benchmark Kong, you need to identify and reach a bottleneck in your system. Only then can you obtain meaningful results, such as QPS from Kong. We'd also like to know what the bottleneck is during your benchmark—it could be 100% CPU usage by Kong, full network card utilization, or 100% CPU usage by Redis.

The kong container maximum bandwidth of the network card is 10 gigabytes The kong container cpu idle rate is more than 85% tcpdump captures the packet and analyzes it, and it is found that the redis server responds to the data within 5ms. timeout was returned after receiving more than 50ms Because with service throttling enabled, each request will trigger nginx to interact with the redis server, look up the value of a key in redis, and incrby will increase the value of the key. The captured data is as follows. The redis server responded within 1.5ms image

I printed out the traceId of each request, and the traceId was used as the key value of the request redis. It can be confirmed that the key of the client's read timeout can be found in the packet capture file, and it is 1.5ms to respond to the client.

The direction of suspicion is that the kernel does not pass the read event to the nginx application layer, and the delay time between epollo events can be determined at the kernel level If the nginx application is not able to handle it in time, how to optimize it, such as adjusting the nginx configuration and linux kernel parameters

chobits commented 2 months ago

The kong container cpu idle rate is more than 85%

You mean, when you benchmarked Kong, it experienced Redis timeouts and only 15% CPU usage of Kong worker processes?

If that's the case, I guess something might be stuck inside Kong.

Did you see any other error logs besides the Redis timeout? For example, any logs related to timer-ng, DNS, connection timeouts, etc.?

The direction of suspicion is that the kernel does not pass the read event to the nginx application layer, and the delay time between epollo events can be determined at the kernel level

I think this should be the last thing to consider. In my experience, I haven't encountered any bugs related to kernel epoll/events.

If the nginx application is not able to handle it in time, how to optimize it, such as adjusting the nginx configuration and linux kernel parameters

We don't know, because currently we don't locate the root cause of this situation.

chobits commented 2 months ago

And for the error log in your Screenshot, like failed to eval Redis: timeout===, is this log introduced by your self? I dont find any error log like this in kong source.

liukaiyuliukaiyu commented 2 months ago

And for the error log in your Screenshot, like failed to eval Redis: timeout===, is this log introduced by your self? I dont find any error log like this in kong source

More debug logs are listed below: 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua access handler, uri:"/test_kylin_upstream_timeout" c:1 2024/08/14 18:08:39 [debug] 7021#7021: 201070 looking up Lua code cache with key '=access_by_lua(nginx-kong.conf:141)nhli_cd92897f6c733fbbef91fc813412a06e' 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua creating new thread 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua reset ctx 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua run thread, top:0 c:1 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua resume returned 1 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua thread yielded 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua coroutine: resume 2024/08/14 18:08:39 [debug] 7021#7021: 201070 [lua] base_plugin.lua:24: access(): executing plugin "rate-limiting": access 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket pool get keepalive peer 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket get keepalive peer: using connection 00007F89D76CE538, fd:171 2024/08/14 18:08:39 [debug] 7021#7021: 201070 calling ngx_http_lua_get_keepalive_peer ngx_del_timer 2024/08/14 18:08:39 [debug] 7021#7021: 201070 event timer del: 171: 4346592003 2024/08/14 18:08:39 [debug] 7021#7021: 201070 http cleanup add: 00000000026E5708 2024/08/14 18:08:39 [error] 7021#7021: 201070 [lua] redis.lua:219: _gen_req(): ======nargs======5, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:39 [error] 7021#7021: 201070 [lua] redis.lua:230: _gen_req(): ======_gen_req=====eval=====i=====1, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:39 [error] 7021#7021: 201070 [lua] redis.lua:230: _gen_req(): ======_gen_req===== local cache_key, expiration = KEYS[1], ARGV[1] 2024/08/14 18:08:39 [error] 7021#7021: 201070 [lua] redis.lua:230: _gen_req(): ======_gen_req=====1=====i=====3, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:39 [error] 7021#7021: 201070 [lua] redis.lua:230: _gen_req(): ======_gen_req=====ratelimit:d7af6063-037c-47d0-934a-1f19c412ea94:d7af6063-037c-47d0-934a-1f19c412ea94:1723630119000:second:0aa0512166bc8227b5ae1b6d04a47402=====i=====4, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:39 [error] 7021#7021: 201070 [lua] redis.lua:230: _gen_req(): ======_gen_req=====1=====i=====5, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket send timeout: 50 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua allocate new chainlink and new buf of size 431, cl:00000000026E5750 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket send data 2024/08/14 18:08:39 [debug] 7021#7021: 201070 send: fd:171 431 of 431 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket sent all the data 2024/08/14 18:08:39 [error] 7021#7021: 201070 [lua] redis.lua:128: eval(): before_receive_time timestamp: 1723630119011, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket calling receive() method 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket read timeout: 50 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua allocate new chainlink and new buf of size 4096, cl:00000000026E5960 2024/08/14 18:08:39 [debug] 7021#7021: 201070 malloc: 00000000026E5AA0:4096 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket read timeout: 50 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket read data: wait:0 2024/08/14 18:08:39 [debug] 7021#7021: 201070 event timer add: 171: 50:4346582060 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua resume returned 1 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua thread yielded 2024/08/14 18:08:39 [debug] 7021#7021: 201070 http run request: "/test_kylin_upstream_timeout?" 2024/08/14 18:08:39 [debug] 7021#7021: 201070 calling write_event_handler 2024/08/14 18:08:39 [debug] 7021#7021: 201070 access phase: 14 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua access handler, uri:"/test_kylin_upstream_timeout" c:1 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua run write event handler: timedout:0, ready:1, writing_raw_req_socket:0 2024/08/14 18:08:39 [debug] 7021#7021: 201070 useless lua write event handler 2024/08/14 18:08:39 [debug] 7021#7021: 201070 write_event_handler executed in 0 ms 2024/08/14 18:08:39 [debug] 7021#7021: 201070 event timer del: 171: 4346582060 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket handler for "/test_kylin_upstream_timeout?", wev 0 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket read handler 2024/08/14 18:08:39 [error] 7021#7021: 201070 lua tcp socket read timed out, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket handle read error 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket waking up the current request 2024/08/14 18:08:39 [debug] 7021#7021: 201070 access phase: 14 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua access handler, uri:"/test_kylin_upstream_timeout" c:1 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp operation done, resuming lua thread 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket calling prepare retvals handler 0000000000663E34, u:00007F89D0185818 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket receive return value handler 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua run thread, top:0 c:1 2024/08/14 18:08:39 [error] 7021#7021: 201070 [lua] redis.lua:132: eval(): =========receive time===== 50ms, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:39 [error] 7021#7021: 201070 [lua] redis.lua:133: eval(): =======line=======nil=====err=====timeout=====partial=====, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua finalize socket 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua http cleanup free: 00000000026E5708 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua close socket connection 2024/08/14 18:08:39 [debug] 7021#7021: 201070 free: 0000000002357AF0, unused: 48 2024/08/14 18:08:39 [debug] 7021#7021: 201070 reusable connection: 0 2024/08/14 18:08:39 [error] 7021#7021: 201070 [lua] init.lua:171: usage(): failed to eval Redis: timeout=====operationid=====0aa0512166bc8227b5ae1b6d04a47402, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:39 [error] 7021#7021: *201070 [lua] handler.lua:101: failed to get usage: timeout, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"

chobits commented 2 months ago

So what's the your custom debug code of this kind of debug log =======line=======nil=====err=====timeout=====partial=====, , we dont find it in kong's source code

And also note that we could not pinpoint the reason of the redis timeout although you show the timeout error there and the source code.

If I were you, I may add more debug log/error log or use stapxx tool to trace the stack of redis timeout. It looks like an interesting journey to debug this.

Also note that the tcpdump info your provided is 100% associated to the timeout situation, if it is yes, then you can debug into kong souce code (redis lua) or openresty lua-nginx-module tcp:sock.

liukaiyuliukaiyu commented 2 months ago

So what's the your custom debug code of this kind of debug log =======line=======nil=====err=====timeout=====partial=====, , we dont find it in kong's source code

And also note that we could not pinpoint the reason of the redis timeout although you show the timeout error there and the source code.

If I were you, I may add more debug log/error log or use stapxx tool to trace the stack of redis timeout. It looks like an interesting journey to debug this.

Also note that the tcpdump info your provided is 100% associated to the timeout situation, if it is yes, then you can debug into kong souce code (redis lua) or openresty lua-nginx-module tcp:sock.

======= The logs with ======= are all added by myself Since it is not required, how should I debug the code? From the debug log and the normal scenario of not reporting the timeout, we can determine that the receive timer is 50ms timeout, delete the timer first, and then report the lua tcp socket read timed out 2024/08/14 18:08:39 [debug] 7021#7021: 201070 useless lua write event handler 2024/08/14 18:08:39 [debug] 7021#7021: 201070 write_event_handler executed in 0 ms 2024/08/14 18:08:39 [debug] 7021#7021: 201070 event timer del: 171: 4346582060 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket handler for "/test_kylin_upstream_timeout?" , wev 0 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket read handler 2024/08/14 18:08:39 [error] 7021#7021: 201070 lua tcp socket read timed out, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket handle read error 2024/08/14 18:08:39 [debug] 7021#7021: 201070 lua tcp socket waking up the current request 2024/08/14 18:08:39 [debug] 7021#7021: *201070 access phase: 14

The following is a debug log that normally does not report a timeout:

2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua access handler, uri:"/test_kylin_upstream_timeout" c:1 2024/08/14 18:08:38 [debug] 7021#7021: 201070 looking up Lua code cache with key '=access_by_lua(nginx-kong.conf:141)nhli_cd92897f6c733fbbef91fc813412a06e' 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua creating new thread 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua reset ctx 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua run thread, top:0 c:1 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua resume returned 1 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua thread yielded 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua coroutine: resume 2024/08/14 18:08:38 [debug] 7021#7021: 201070 [lua] base_plugin.lua:24: access(): executing plugin "rate-limiting": access 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket pool get keepalive peer 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket get keepalive peer: using connection 00007F89D76CDB60, fd:162 2024/08/14 18:08:38 [debug] 7021#7021: 201070 calling ngx_http_lua_get_keepalive_peer ngx_del_timer 2024/08/14 18:08:38 [debug] 7021#7021: 201070 event timer del: 162: 4346591918 2024/08/14 18:08:38 [debug] 7021#7021: 201070 http cleanup add: 00000000026E6718 2024/08/14 18:08:38 [error] 7021#7021: 201070 [lua] redis.lua:219: _gen_req(): ======nargs======5, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:38 [error] 7021#7021: 201070 [lua] redis.lua:230: _gen_req(): ======_gen_req=====eval=====i=====1, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:38 [error] 7021#7021: 201070 [lua] redis.lua:230: _gen_req(): ======_gen_req===== local cache_key, expiration = KEYS[1], ARGV[1] 2024/08/14 18:08:38 [error] 7021#7021: 201070 [lua] redis.lua:230: _gen_req(): ======_gen_req=====1=====i=====3, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:38 [error] 7021#7021: 201070 [lua] redis.lua:230: _gen_req(): ======_gen_req=====ratelimit:d7af6063-037c-47d0-934a-1f19c412ea94:d7af6063-037c-47d0-934a-1f19c412ea94:1723630118000:second:0aa0512166bc8226b5ae1b6d04a43d02=====i=====4, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:38 [error] 7021#7021: 201070 [lua] redis.lua:230: _gen_req(): ======_gen_req=====1=====i=====5, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket send timeout: 50 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua allocate new chainlink and new buf of size 431, cl:00000000026E6760 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket send data 2024/08/14 18:08:38 [debug] 7021#7021: 201070 send: fd:162 431 of 431 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket sent all the data 2024/08/14 18:08:38 [error] 7021#7021: 201070 [lua] redis.lua:128: eval(): before_receive_time timestamp: 1723630118926, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket calling receive() method 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket read timeout: 50 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua allocate new chainlink and new buf of size 4096, cl:00000000026E6970 2024/08/14 18:08:38 [debug] 7021#7021: 201070 malloc: 00000000026E6AB0:4096 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket read timeout: 50 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket read data: wait:0 2024/08/14 18:08:38 [debug] 7021#7021: 201070 event timer add: 162: 50:4346581975 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua resume returned 1 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua thread yielded 2024/08/14 18:08:38 [debug] 7021#7021: 201070 http run request: "/test_kylin_upstream_timeout?" 2024/08/14 18:08:38 [debug] 7021#7021: 201070 calling write_event_handler 2024/08/14 18:08:38 [debug] 7021#7021: 201070 access phase: 14 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua access handler, uri:"/test_kylin_upstream_timeout" c:1 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua run write event handler: timedout:0, ready:1, writing_raw_req_socket:0 2024/08/14 18:08:38 [debug] 7021#7021: 201070 useless lua write event handler 2024/08/14 18:08:38 [debug] 7021#7021: 201070 write_event_handler executed in 0 ms 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket handler for "/test_kylin_upstream_timeout?", wev 0 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket read handler 2024/08/14 18:08:38 [debug] 7021#7021: 201070 calling ngx_http_lua_socket_read_handler ngx_del_timer 2024/08/14 18:08:38 [debug] 7021#7021: 201070 event timer del: 162: 4346581975 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket read data: wait:1 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket try to recv data 4096: "/test_kylin_upstream_timeout?" 2024/08/14 18:08:38 [debug] 7021#7021: 201070 recv: eof:0, avail:1 2024/08/14 18:08:38 [debug] 7021#7021: 201070 recv: fd:162 4 of 4096 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket recv returned 4: "/test_kylin_upstream_timeout?" 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket read line "024/08/14 18:08:38 [debug] 7021#7021: 201070 lua read the final line part: ":0 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket receive done: wait:1, eof:0, uri:"/test_kylin_upstream_timeout?" 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket waking up the current request (read) 2024/08/14 18:08:38 [debug] 7021#7021: 201070 access phase: 14 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua access handler, uri:"/test_kylin_upstream_timeout" c:1 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp operation done, resuming lua thread 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket calling prepare retvals handler 0000000000663E34, u:00007F89CB123010 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket receive return value handler 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua run thread, top:0 c:1 2024/08/14 18:08:38 [error] 7021#7021: 201070 [lua] redis.lua:132: eval(): =========receive time===== 36ms, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:38 [error] 7021#7021: 201070 [lua] redis.lua:133: eval(): =======line=======:0=====err=====nil=====partial=====nil, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket set keepalive: saving connection 00007F89D76CDB60 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket clear current socket connection 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua tcp socket keepalive timeout: 10000 ms 2024/08/14 18:08:38 [debug] 7021#7021: 201070 event timer add: 162: 10000:4346591961 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua finalize socket 2024/08/14 18:08:38 [debug] 7021#7021: 201070 lua http cleanup free: 00000000026E6718 2024/08/14 18:08:38 [error] 7021#7021: 201070 [lua] init.lua:182: usage(): =========Redis: after set_keepalive=======, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"

liukaiyuliukaiyu commented 2 months ago

So what's the your custom debug code of this kind of debug log =======line=======nil=====err=====timeout=====partial=====, , we dont find it in kong's source code

And also note that we could not pinpoint the reason of the redis timeout although you show the timeout error there and the source code.

If I were you, I may add more debug log/error log or use stapxx tool to trace the stack of redis timeout. It looks like an interesting journey to debug this.

Also note that the tcpdump info your provided is 100% associated to the timeout situation, if it is yes, then you can debug into kong souce code (redis lua) or openresty lua-nginx-module tcp:sock.

The error in the nginx code: timer delta: 54ms It is possible to determine if there is a delay in the ngx_process_events function processing the events

2024/08/20 19:55:42 [debug] 7019#7019: 1338394 http run request: "/test_kylin_upstream_timeout?" 2024/08/20 19:55:42 [debug] 7019#7019: 1338394 calling write_event_handler 2024/08/20 19:55:42 [debug] 7019#7019: 1338394 access phase: 14 2024/08/20 19:55:42 [debug] 7019#7019: 1338394 lua access handler, uri:"/test_kylin_upstream_timeout" c:1 2024/08/20 19:55:42 [debug] 7019#7019: 1338394 lua run write event handler: timedout:0, ready:1, writing_raw_req_socket:0 2024/08/20 19:55:42 [debug] 7019#7019: 1338394 useless lua write event handler 2024/08/20 19:55:42 [debug] 7019#7019: 1338394 write_event_handler executed in 0 ms 2024/08/20 19:55:42 [debug] 7019#7019: timer delta: 54 2024/08/20 19:55:42 [debug] 7019#7019: 1338482 event timer del: 75: 4871405350 2024/08/20 19:55:42 [debug] 7019#7019: 1338482 lua tcp socket handler for "/test_kylin_upstream_timeout?", wev 0 2024/08/20 19:55:42 [debug] 7019#7019: 1338482 lua tcp socket read handler 2024/08/20 19:55:42 [error] 7019#7019: 1338482 lua tcp socket read timed out, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/20 19:55:42 [debug] 7019#7019: 1338482 lua tcp socket handle read error 2024/08/20 19:55:42 [debug] 7019#7019: 1338482 lua tcp socket waking up the current request 2024/08/20 19:55:42 [debug] 7019#7019: 1338482 access phase: 14 2024/08/20 19:55:42 [debug] 7019#7019: 1338482 lua access handler, uri:"/test_kylin_upstream_timeout" c:1 2024/08/20 19:55:42 [debug] 7019#7019: 1338482 lua tcp operation done, resuming lua thread 2024/08/20 19:55:42 [debug] 7019#7019: 1338482 lua tcp socket calling prepare retvals handler 0000000000663E34, u:00007F89CB1C03A8 2024/08/20 19:55:42 [debug] 7019#7019: 1338482 lua tcp socket receive return value handler 2024/08/20 19:55:42 [debug] 7019#7019: 1338482 lua run thread, top:0 c:1 2024/08/20 19:55:42 [error] 7019#7019: 1338482 [lua] redis.lua:132: eval(): =========receive time===== 55ms, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/20 19:55:42 [error] 7019#7019: 1338482 [lua] redis.lua:133: eval(): =======line=======nil=====err=====timeout=====partial=====, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000" 2024/08/20 19:55:42 [debug] 7019#7019: 1338482 lua finalize socket 2024/08/20 19:55:42 [debug] 7019#7019: 1338482 lua http cleanup free: 0000000002579E88 2024/08/20 19:55:42 [debug] 7019#7019: 1338482 lua close socket connection 2024/08/20 19:55:42 [debug] 7019#7019: 1338482 free: 0000000002356020, unused: 48 2024/08/20 19:55:42 [debug] 7019#7019: 1338482 reusable connection: 0 2024/08/20 19:55:42 [error] 7019#7019: *1338482 [lua] init.lua:171: usage(): failed to eval Redis: timeout=====operationid=====0aa0512166c4843eb6041b6b04bce802, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"

liukaiyuliukaiyu commented 2 months ago

nginx source code


    delta = ngx_current_msec;

    (void) ngx_process_events(cycle, timer, flags);

    delta = ngx_current_msec - delta;

    ngx_log_debug1(NGX_LOG_DEBUG_EVENT, cycle->log, 0,
                   "timer delta: %M", delta);`
chobits commented 2 months ago

The error in the nginx code: timer delta: 54ms It is possible to determine if there is a delay in the ngx_process_events function processing the events

I don’t think 54ms is excessively long. During one cycle of epoll events processing (ngx_process_events), there aren't any special I/O operations happening aside from writing debug logs. Therefore, it seems likely that the 54ms is primarily consumed by CPU-intensive operations and the time taken for debug log writing.

chobits commented 2 months ago

From your debug log

2024/08/20 19:55:42 [error] 7019#7019: *1338482 [lua] redis.lua:132: eval(): =========receive time===== 55ms, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/20 19:55:42 [error] 7019#7019: *1338482 [lua] redis.lua:133: eval(): =======line=======nil=====err=====timeout=====partial=====, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"

These two error log entries are puzzling, one indicated a timeout, while the other reports a latency of only 55ms. In my opinion, 55ms is not a significant latency.


First, it’s best to capture a situation where the delay exceeds some seconds (might be > 1s, which indicated a significat timeout). Make sure to have the corresponding tcpdump packets and debug log information, especially noting that the debug log should cover from when the request is sent from the Redis connection to when the response is received (or the timeout occurs). Then, you can analyze the timeout phenomenon by combining the tcpdump packet capture and the debug logs.

liukaiyuliukaiyu commented 2 months ago

The error in the nginx code: timer delta: 54ms It is possible to determine if there is a delay in the ngx_process_events function processing the events

I don’t think 54ms is excessively long. During one cycle of epoll events processing (ngx_process_events), there aren't any special I/O operations happening aside from writing debug logs. Therefore, it seems likely that the 54ms is primarily consumed by CPU-intensive operations and the time taken for debug log writing.

@chobits We checked with our colleagues in the kernel group and determined that the 50ms delay was in user mode, that is, in lua+nginx. We need to continue to see where the card was stuck in user mode and whether the lua or nginx parameters can be adjusted. Now at least the direction is clear

liukaiyuliukaiyu commented 1 month ago

Please note that to effectively benchmark Kong, you need to identify and reach a bottleneck in your system. Only then can you obtain meaningful results, such as QPS from Kong. We'd also like to know what the bottleneck is during your benchmark—it could be 100% CPU usage by Kong, full network card utilization, or 100% CPU usage by Redis.

@chobits Looking at the basic metrics of the container, we can see that the redis timeout will be triggered when the number of net.sockets.tp.inuse connections on the machine reaches 6000. How should we adjust? Do you have similar experience

github-actions[bot] commented 1 month ago

This issue is marked as stale because it has been open for 14 days with no activity.

chobits commented 1 month ago

Please note that to effectively benchmark Kong, you need to identify and reach a bottleneck in your system. Only then can you obtain meaningful results, such as QPS from Kong. We'd also like to know what the bottleneck is during your benchmark—it could be 100% CPU usage by Kong, full network card utilization, or 100% CPU usage by Redis.

@chobits Looking at the basic metrics of the container, we can see that the redis timeout will be triggered when the number of net.sockets.tp.inuse connections on the machine reaches 6000. How should we adjust? Do you have similar experience

I think we need to know what these 6000 connections are, like using command netstat -anp. But for your issue, I think it may need your more in-depth troubleshooting.

determined that the 50ms delay was in user mode

And for this, it might need you to use some debug tool, like systap/gdb script, to insert hook to some C function to see why there is a delay occuring in user mode(kong+lua). For nginx/kong/lua, 50ms pure computing is rare and easy to find ( like the metric of 100% cpu usage) , I guess there might be some IO waiting operation.

github-actions[bot] commented 4 weeks ago

This issue is marked as stale because it has been open for 14 days with no activity.

github-actions[bot] commented 3 weeks ago

Dear contributor,

We are automatically closing this issue because it has not seen any activity for three weeks. We're sorry that your issue could not be resolved. If any new information comes up that could help resolving it, please feel free to reopen it.

Your contribution is greatly appreciated!

Please have a look our pledge to the community for more information.

Sincerely, Your Kong Gateway team