Closed alptugay closed 1 year ago
@alptugay Which version of lua-resty-healthcheck do you use finally with error? Do the upstream servers change status frequently at that time? i.e. restart pods or shutdown or new upstreams added, or network transient errors?
I suggest using the lua-resty-healthcheck 2.2.3 and increasing the size of upstream-healthcheck
, e.g. 500m and rechecking again.
And what's the configuration item redis_healthcheck_shared_dict_size
? Have you changed the code of apisix or lua-resty-healthcheck?
Maybe it's same to https://github.com/apache/apisix/issues/9015
Hi @leslie-tsang,
Their team is under monitoring the status. They will update when necessary.
Due to lack of the reporter's response this issue has been labeled with "no response". It will be close in 3 days if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@apisix.apache.org list. Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.
We have encountered the issue once again. Our current findings are:
1) 4 out of 6 Apisix instances in the same environment were affected in different times in the same day. These are identical machines running with the same configs.
APISIX INSTANCE 1
2023/08/08 15:30:09 [error] 480719#480719: *24417911412 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.92:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:30:10 [error] 480724#480724: *24417921777 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.92:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:30:11 [error] 480739#480739: *24417932673 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.92:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:30:12 [error] 480716#480716: *24417943934 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.92:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:38:39 [error] 480728#480728: *24422048257 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:32708': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:38:40 [error] 480735#480735: *24422057094 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:32708': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:38:41 [error] 480723#480723: *24422065842 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:32708': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:38:42 [error] 480733#480733: *24422074126 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:32708': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:38:43 [error] 480736#480736: *24422083755 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:32708': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
APISIX INSTANCE 2
2023/08/08 14:23:39 [error] 608073#608073: *24408524288 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 14:23:41 [error] 608090#608090: *24408534735 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 14:23:42 [error] 608078#608078: *24408545713 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 14:23:43 [error] 608087#608087: *24408556591 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
APISIX INSTANCE 3
2023/08/08 16:29:39 [error] 67061#67061: *27466062548 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.94:31725': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 16:29:40 [error] 67045#67045: *27466073872 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.94:31725': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 16:29:41 [error] 67063#67063: *27466084337 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.94:31725': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 16:29:42 [error] 67041#67041: *27466094834 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.94:31725': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 16:34:33 [error] 67043#67043: *27468414132 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.91:30238': unlocked, context: ngx.timer, client: XXX.XXX.XXX.191, server: 0.0.0.0:80
2023/08/08 16:34:34 [error] 67050#67050: *27468422768 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.91:30238': unlocked, context: ngx.timer, client: XXX.XXX.XXX.94, server: 0.0.0.0:80
2023/08/08 16:34:35 [error] 67051#67051: *27468432136 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.91:30238': unlocked, context: ngx.timer, client: XXX.XXX.XXX.94, server: 0.0.0.0:80
2023/08/08 16:34:36 [error] 67042#67042: *27468440315 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.91:30238': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 17:13:03 [error] 291549#291549: *27482520502 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.92:30057': unlocked, context: ngx.timer, client: XXX.XXX.XXX.196, server: 0.0.0.0:80
2023/08/08 17:13:04 [error] 67057#67057: *27482527129 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.92:30057': unlocked, context: ngx.timer, client: XXX.XXX.XXX.198, server: 0.0.0.0:80
2023/08/08 17:13:05 [error] 67046#67046: *27482533105 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.92:30057': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 17:13:06 [error] 291659#291659: *27482539195 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.92:30057': unlocked, context: ngx.timer, client: XXX.XXX.XXX.198, server: 0.0.0.0:80
2023/08/08 17:22:33 [error] 291921#291921: *27485105465 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: 10.212.200.129, server: 0.0.0.0:443
2023/08/08 17:22:34 [error] 67058#67058: *27485109964 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 17:22:36 [error] 291582#291582: *27485115892 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.96, server: 0.0.0.0:80
2023/08/08 17:22:37 [error] 67072#67072: *27485120591 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.198, server: 0.0.0.0:80
APISIX INSTANCE 4
2023/08/08 15:54:03 [error] 65587#65587: *27451715285 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.93:32496': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 15:54:04 [error] 65602#65602: *27451726740 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.93:32496': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 15:54:05 [error] 65606#65606: *27451738230 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.93:32496': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 15:54:06 [error] 65596#65596: *27451750351 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.93:32496': unlocked, context: ngx.timer, client: XXX.XXX.XXX.191, server: 0.0.0.0:80
2) As it can be seen in the logs the errors were related to 2 upstreams 58e0a3b27292c0d9267ce78a7dfa45ab
and f51cbfa1fac84b2110ba93f947c93313
3) We have found that event though the backend of these upstreams were serving through HTTP, we have mistakenly set the scheme of the upstreams as HTTPS
4) Also the are duplicate IP-Port pairs in these upstream definitions. Below you can see the content of these upstreams:
For example:
"port": 32330,
"host": "XXX.XXX.XXX.30"
Upstream: 58e0a3b27292c0d9267ce78a7dfa45ab
{
"value": {
"type": "roundrobin",
"create_time": 1684840040,
"timeout": {
"connect": 300,
"read": 300,
"send": 300
},
"name": "thanos-query-store-api",
"id": "58e0a3b27292c0d9267ce78a7dfa45ab",
"hash_on": "vars",
"desc": "source:consul",
"pass_host": "pass",
"update_time": 1691489116,
"labels": {
"source": "consul"
},
"scheme": "https",
"nodes": [
{
"priority": 0,
"port": 32347,
"weight": 1,
"host": "XXX.XXX.XXX.33"
},
{
"priority": 0,
"port": 32337,
"weight": 1,
"host": "XXX.XXX.XXX.29"
},
{
"priority": 0,
"port": 32337,
"weight": 1,
"host": "XXX.XXX.XXX.31"
},
{
"priority": 0,
"port": 32330,
"weight": 1,
"host": "XXX.XXX.XXX.30"
},
{
"priority": 0,
"port": 32496,
"weight": 1,
"host": "XXX.XXX.XXX.93"
},
{
"priority": 0,
"port": 32347,
"weight": 1,
"host": "XXX.XXX.XXX.31"
},
{
"priority": 0,
"port": 31141,
"weight": 1,
"host": "XXX.XXX.XXX.33"
},
{
"priority": 0,
"port": 32496,
"weight": 1,
"host": "XXX.XXX.XXX.198"
},
{
"priority": 0,
"port": 32190,
"weight": 1,
"host": "XXX.XXX.XXX.195"
},
{
"priority": 0,
"port": 30057,
"weight": 1,
"host": "XXX.XXX.XXX.92"
},
{
"priority": 0,
"port": 30238,
"weight": 1,
"host": "XXX.XXX.XXX.91"
},
{
"priority": 0,
"port": 31141,
"weight": 1,
"host": "XXX.XXX.XXX.30"
},
{
"priority": 0,
"port": 32330,
"weight": 1,
"host": "XXX.XXX.XXX.31"
},
{
"priority": 0,
"port": 32190,
"weight": 1,
"host": "XXX.XXX.XXX.95"
},
{
"priority": 0,
"port": 32330,
"weight": 1,
"host": "XXX.XXX.XXX.30"
},
{
"priority": 0,
"port": 30057,
"weight": 1,
"host": "XXX.XXX.XXX.92"
},
{
"priority": 0,
"port": 32330,
"weight": 1,
"host": "XXX.XXX.XXX.30"
},
{
"priority": 0,
"port": 32330,
"weight": 1,
"host": "XXX.XXX.XXX.31"
},
{
"priority": 0,
"port": 30238,
"weight": 1,
"host": "XXX.XXX.XXX.191"
}
],
"checks": {
"active": {
"unhealthy": {
"timeouts": 3,
"http_statuses": [
429,
404,
500,
501,
502,
503,
504,
505
],
"interval": 1,
"http_failures": 5,
"tcp_failures": 2
},
"type": "tcp",
"http_path": "/",
"healthy": {
"successes": 2,
"http_statuses": [
200,
302
],
"interval": 1
},
"timeout": 1,
"concurrency": 10,
"https_verify_certificate": true
}
}
},
"modifiedIndex": 1695071,
"createdIndex": 1165,
"key": "/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab"
}
Upstream: f51cbfa1fac84b2110ba93f947c93313
{
"value": {
"type": "roundrobin",
"create_time": 1684840040,
"timeout": {
"connect": 300,
"read": 300,
"send": 300
},
"nodes": [
{
"priority": 0,
"port": 30249,
"weight": 1,
"host": "XXX.XXX.XXX.33"
},
{
"priority": 0,
"port": 31725,
"weight": 1,
"host": "XXX.XXX.XXX.94"
},
{
"priority": 0,
"port": 31914,
"weight": 1,
"host": "XXX.XXX.XXX.96"
},
{
"priority": 0,
"port": 31125,
"weight": 1,
"host": "XXX.XXX.XXX.194"
},
{
"priority": 0,
"port": 31569,
"weight": 1,
"host": "XXX.XXX.XXX.30"
},
{
"priority": 0,
"port": 31500,
"weight": 1,
"host": "XXX.XXX.XXX.33"
},
{
"priority": 0,
"port": 30593,
"weight": 1,
"host": "XXX.XXX.XXX.30"
},
{
"priority": 0,
"port": 32708,
"weight": 1,
"host": "XXX.XXX.XXX.194"
},
{
"priority": 0,
"port": 31125,
"weight": 1,
"host": "XXX.XXX.XXX.92"
},
{
"priority": 0,
"port": 30593,
"weight": 1,
"host": "XXX.XXX.XXX.33"
},
{
"priority": 0,
"port": 31725,
"weight": 1,
"host": "XXX.XXX.XXX.93"
},
{
"priority": 0,
"port": 31569,
"weight": 1,
"host": "XXX.XXX.XXX.33"
},
{
"priority": 0,
"port": 31914,
"weight": 1,
"host": "XXX.XXX.XXX.94"
},
{
"priority": 0,
"port": 32708,
"weight": 1,
"host": "XXX.XXX.XXX.95"
},
{
"priority": 0,
"port": 31500,
"weight": 1,
"host": "XXX.XXX.XXX.31"
},
{
"priority": 0,
"port": 30249,
"weight": 1,
"host": "XXX.XXX.XXX.31"
}
],
"pass_host": "pass",
"hash_on": "vars",
"desc": "source:consul",
"id": "f51cbfa1fac84b2110ba93f947c93313",
"update_time": 1689321033,
"labels": {
"source": "consul"
},
"scheme": "https",
"name": "thanos-queryfrontend-api",
"checks": {
"active": {
"unhealthy": {
"timeouts": 3,
"interval": 1,
"http_statuses": [
429,
404,
500,
501,
502,
503,
504,
505
],
"http_failures": 5,
"tcp_failures": 2
},
"type": "tcp",
"http_path": "/",
"healthy": {
"successes": 2,
"http_statuses": [
200,
302
],
"interval": 1
},
"timeout": 1,
"concurrency": 10,
"https_verify_certificate": true
}
}
},
"modifiedIndex": 1239801,
"createdIndex": 1181,
"key": "/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313"
}
5) There is only one entry in the output of ss -t
that contains an entry regarding to the upstream 58e0a3b27292c0d9267ce78a7dfa45ab
2023/08/08 17:22:37 [error] 67072#67072: *27485120591 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.198, server: 0.0.0.0:80
SYN-SENT 0 1 YYY.YYY.YYY.YYY:42790 XXX.XXX.XXX.29:32337
XXX.XXX.XXX.29:32337 is one of the upstreams of 58e0a3b27292c0d9267ce78a7dfa45ab
6) I am posting some logs just before we receive failed to release lock logs:
You can ignore worker process **** exited on signal 9
logs. Because we have a script that kills the workers as soon as it detects "failed to release lock" logs.
2023/08/08 17:22:27 [error] 291898#291898: *27485100457 [lua] events.lua:273: post(): worker-events: failed posting event "mostly_healthy" by "lua-resty-healthcheck [upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab]"; no memory while connect
ing to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.91:30238/", host: "thanos-query-store-api.AAAA.BBBB.com"
2023/08/08 17:22:27 [error] 291898#291898: *27485100457 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.95:32190/", host: "thanos-query-store-api.AAAA.BBBB.com"
2023/08/08 17:22:28 [error] 291695#291695: *27485100523 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291632#291632: *27485100556 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 67058#67058: *27485100578 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291940#291940: *27485100654 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 67044#67044: *27485100658 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291520#291520: *27485100669 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291921#291921: *27485100780 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291682#291682: *27485100840 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291582#291582: *27485100855 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 67077#67077: *27485100854 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291877#291877: *27485100859 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 67072#67072: *27485100860 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.94:31725/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.93:31725/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.194:31125/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.194:32708/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.92:31125/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.96:31914/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.94:31914/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.95:32708/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 [lua] balancer.lua:362: run(): failed to pick server: failed to find valid upstream server, all upstream servers tried while connecting to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.95:32708/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291921#291921: *27485105465 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: YYY.YYY.YYY.129, server: 0.0.0.0:443
2023/08/08 17:22:34 [alert] 27489#27489: worker process 291921 exited on signal 9
2023/08/08 17:22:34 [error] 67058#67058: *27485109964 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 17:22:35 [alert] 27489#27489: worker process 67058 exited on signal 9
2023/08/08 17:22:36 [error] 291582#291582: *27485115892 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.96, server: 0.0.0.0:80
2023/08/08 17:22:37 [alert] 27489#27489: worker process 291582 exited on signal 9
2023/08/08 17:22:37 [error] 67072#67072: *27485120591 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.198, server: 0.0.0.0:80
2023/08/08 17:22:38 [alert] 27489#27489: worker process 67072 exited on signal 9
I think the SSL errors are related to our finding number 3 worker-events dict size is: 1024m
7) We also have an interesting finding.
This is a normal log (after we kill the affected workers) for the related service, it is normal that all upstreams are 502 because of our finding number 3
{"@timestamp":"2023-08-08T17:56:03.484664+03:00", "_team":"traffic", "_app":"loadbalancer", "_federation":"AAA", "_region":"BBB", "_environment":"prod", "_cluster":"internal", "_hostname":"Apisix01", "_service":"loadbalancer", "tag":"lb_access_non_2xx3xx:", "severity":"6", "facility":"23", "severity_label":"info", "facility_label":"local7", "cf_ipcountry":"", "http_x_client_ip":"", "http_True_Client_IP":"", "upstream_http_X_Proxy_Cache":"", "request_id":"5364e8127a236e359699bffe80851093", "route_id":"", "request_length":"120", "remote_addr":"XXX.XXX.XXX.196", "remote_port":"10506", "request":"GET / HTTP/1.1", "args":"", "uri":"/", "status":"502", "bytes_sent":"754", "http_user_agent":"Go-http-client/1.1", "http_x_forwarded_for":"", "http_host":"thanos-queryfrontend-api.aaaa.bbbb.com", "server_name":"_", "request_time":"0.012", "upstream":"XXX.XXX.XXX.95:32708, XXX.XXX.XXX.31:31500, XXX.XXX.XXX.31:30249, XXX.XXX.XXX.33:30249, XXX.XXX.XXX.30:31569, XXX.XXX.XXX.33:31500, XXX.XXX.XXX.30:30593, XXX.XXX.XXX.33:30593, XXX.XXX.XXX.33:31569, XXX.XXX.XXX.94:31725, XXX.XXX.XXX.96:31914, XXX.XXX.XXX.194:31125, XXX.XXX.XXX.194:32708, XXX.XXX.XXX.92:31125, XXX.XXX.XXX.93:31725, XXX.XXX.XXX.94:31914", "upstream_connect_time":"-, -, -, -, -, -, -, -, -, -, -, -, -, -, -, -", "upstream_status":"502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502", "upstream_response_time":"0.002, 0.000, 0.000, 0.001, 0.000, 0.000, 0.001, 0.000, 0.000, 0.002, 0.000, 0.001, 0.001, 0.001, 0.001, 0.001", "upstream_cache_status":"", "ssl_protocol":"", "ssl_cipher":"", "scheme":"http", "server_port":"80", "request_method":"GET", "server_protocol":"HTTP/1.1", "http_cf_ray":"", "ty_lb_waf_id":"", "ty_lb_cc":"", "ty_lb_asn":""}
But this is an unusual log, and the logs related to this service during the problem is like the log below:
{"@timestamp":"2023-08-08T17:37:03.482507+03:00", "_team":"traffic", "_app":"loadbalancer", "_federation":"AAA", "_region":"BBB", "_environment":"prod", "_cluster":"internal", "_hostname":"Apisix01", "_service":"loadbalancer", "tag":"lb_access_non_2xx3xx:", "severity":"6", "facility":"23", "severity_label":"info", "facility_label":"local7", "cf_ipcountry":"", "http_x_client_ip":"", "http_True_Client_IP":"", "upstream_http_X_Proxy_Cache":"", "request_id":"ef16786dd332690d1fd06228bb08788c", "route_id":"", "request_length":"120", "remote_addr":"XXX.XXX.XXX.94", "remote_port":"61553", "request":"GET / HTTP/1.1", "args":"", "uri":"/", "status":"502", "bytes_sent":"680", "http_user_agent":"Go-http-client/1.1", "http_x_forwarded_for":"", "http_host":"thanos-queryfrontend-api.aaaa.bbbb.com", "server_name":"_", "request_time":"0.014", "upstream":"XXX.XXX.XXX.96:31914, XXX.XXX.XXX.194:31125, XXX.XXX.XXX.194:32708, XXX.XXX.XXX.92:31125, XXX.XXX.XXX.93:31725, XXX.XXX.XXX.94:31914, XXX.XXX.XXX.95:32708, XXX.XXX.XXX.94:31725 : ", "upstream_connect_time":"-, -, -, -, -, -, -, - : ", "upstream_status":"502, 502, 502, 502, 502, 502, 502, 502 : ", "upstream_response_time":"0.002, 0.001, 0.005, 0.001, 0.001, 0.001, 0.002, 0.001 : ", "upstream_cache_status":"", "ssl_protocol":"", "ssl_cipher":"", "scheme":"http", "server_port":"80", "request_method":"GET", "server_protocol":"HTTP/1.1", "http_cf_ray":"", "ty_lb_waf_id":"", "ty_lb_cc":"", "ty_lb_asn":""}
Please look at the upstream, upstream_status, upstream_connect_time, upstream_response_time. They are like incomplete. They all and with colon character.:
Hello @alptugay,
The issue is under fixing, check https://github.com/api7/lua-resty-healthcheck/commit/f1d81af1601915c7dcbe160078d4f1d09424ba36 for more information.
@alptugay have you try this fix ? https://github.com/api7/lua-resty-healthcheck/commit/f1d81af1601915c7dcbe160078d4f1d09424ba36
@alptugay you can try with apisix version 3.4>= where the lua-resty-healthcheck library is upgraded
Current Behavior
Our Apisix instance has more than 2500 route and upstream all of which has active healthcheck enabled. Sometimes (once/twice a week) we see that one or more workers use %100 of CPU. At the same time we see the following error log:
We have encountered this situation on multiple instances and multiple times:
At the same exact time we can see the CPU core running this worker starts to use %100 CPU
That Cpu Core also sees an increase in time spent
Sockstat usage increases as well:
After killing the worker process it returns to normal (BTW we cant kill the process gracefully, we need to use "kill -9")
We are normally using Apisix 3.2.1 but to solve the issue we cherry picked "v3.0.0 of lua-resty-healthcheck", because the lock mechanisim seems to have changed but that caused massive memory leak so we reverted.
All of our upstreams have the following timeout and healthcheck values:
There are some other abnormalities as well, I don't know if it is related or not so I'll briefly share those as well:
We see lots of upstream time out errors, however the upstreams are healthy and running. These connections seem to be related to a watch API of K8s (not sure)
We have TEngine running on parallel with Apisix, both have healthchecks enabled. But connection states are very different, for example on TEngine we see less connection in TimeWait and more inUse:
However in Apisix we see more in Timewait and less inUse:
When we disable healthcheck, it returns to the similar state with TEngine.
Our config.yml
Expected Behavior
Cpu should not be consumed %100
Error Logs
2023/07/04 03:15:53 [error] 161670#161670: 27856946269 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/routes/461091278096960700) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/routes/461091278096960700:target_lock:...:8500': unlocked, context: ngx.timer, client: ...*, server: 0.0.0.0:80
Steps to Reproduce
Seems to be random. But 2500+ route/service and upstream. All with active healthchecks enabled
Environment
curl http://127.0.0.1:9090/v1/server_info
):luarocks --version
):