Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/#kong-community
Apache License 2.0
38.89k stars 4.78k forks source link

"failed to release lock" error during active healthchecks #9221

Closed codemug closed 2 years ago

codemug commented 2 years ago

Is there an existing issue for this?

Kong version ($ kong version)

2.8.1

Current Behavior

I'm running Kong in declarative mode to expose a bunch of services. I've configured the following active healthcheck on all of them:

active:
  type: http
  httpPath: "/health"
  healthy:
    httpStatuses:
      - 200
      - 302
    interval: 10
    successes: 2
  timeout: 15
  unhealthy:
    httpStatuses:
      - 429
      - 404
      - 500
      - 501
      - 502
      - 503
      - 504
      - 505
    httpFailures: 8
    interval: 10
    tcpFailures: 8
    timeouts: 8

Every once in a while, an upstream target randomly becomes unhealthy and even after the actual upstream service becomes available, the status of the upstream target never returns to healthy.

In one of these instances, I've noticed the following logs in Kong proxy:

2022/08/09 12:04:18 [error] 1118#0: *136780853 [lua] healthcheck.lua:1235: log(): [healthcheck] (0dc6f45b-8f8d-40d2-a504-473544ee190b:my-service) failed to release lock 'lua-resty-healthcheck:0dc6f45b-8f8d-40d2-a504-473544ee190b:my-service:target_list_lock': unlocked, context: ngx.timer
2022/08/09 12:04:18 [error] 1118#0: *136780854 [lua] healthcheck.lua:1235: log(): [healthcheck] (0dc6f45b-8f8d-40d2-a504-473544ee190b:my-service) failed to release lock 'lua-resty-healthcheck:0dc6f45b-8f8d-40d2-a504-473544ee190b:my-service:target_list_lock': unlocked, context: ngx.timer
2022/08/09 12:04:18 [warn] 1115#0: *136781978 [lua] healthcheckers.lua:98: callback(): [healthchecks] balancer 0dc6f45b-8f8d-40d2-a504-473544ee190b:my-service reported health status changed to UNHEALTHY, context: ngx.timer

What kind of lock is this?

I've even checked the resource usage of Kong. It's allowed to use 2 CPUs and 4Gs of memory. The CPU usage remains at around 1% and the memory never crosses 1G.

Expected Behavior

The upstream target should be set to healthy by the healthchecker once the upstream service is available.

Steps To Reproduce

This occurs very randomly, I have tried reproducing it several times but to no avail.

Anything else?

No response

surenraju-careem commented 2 years ago

Kong 2.8.1 uses lua-resty-healthcheck = 1.5.1. There is new minor version lua-resty-healthcheck = 1.5.2 which has lot of bug fixes related to handling locking.

https://github.com/Kong/lua-resty-healthcheck/compare/1.5.1...1.5.2

https://github.com/Kong/lua-resty-healthcheck/pull/113 https://github.com/Kong/lua-resty-healthcheck/pull/114 https://github.com/Kong/lua-resty-healthcheck/pull/116

lua-resty-healthcheck version 1.5.2 does not even have log statement failed to release reported on this issue.

Is there any plan to release(ex: 2.8.2) kong with lua-resty-healthcheck = 1.5.2?

codemug commented 2 years ago

I should add that my upstream has a single target registered with it. I'm suspecting that this has something to do with the default value of healthchecks.active.concurrency which is 10 by default. Maybe having 10 parallel timers on a single target is causing lock timeouts due to long response delays from the upstream.

codemug commented 2 years ago

Adding more info here, I think this is related to how I'm managing the routing configurations on my setup. I'm running Kong in declarative mode and I have a sidecar daemon that's doing a POST on the config endpoint every 5 seconds. I based this design on the ingress controller. From debug logs I'm seeing that due to this, the configuration cache is being flushed and refreshed every time a POST call is made which may also be resetting the healthchecks. I'm investigating it more.

codemug commented 2 years ago

My above assumption was correct. Each new POST call will reset the healthchecks. Closing this ticket.

mcdullbloom commented 1 year ago

My above assumption was correct. Each new POST call will reset the healthchecks. Closing this ticket.

I also have a ingress controller to change my usptream node, then health check object will be created again.And I got the "failed to release lock" log. I still don't understand why healthchecks reset will lead the lock released failed.Can you explain it more detail?Thanks

mcdullbloom commented 1 year ago

I found the answer by debuging lua-resty-lock. There is a bug in pcall which has been fixed.

fix(lock) handle all resty.lock failure modes (https://github.com/Kong/lua-resty-healthcheck/pull/112)

Kong 2.8.1 uses lua-resty-healthcheck = 1.5.1. There is new minor version lua-resty-healthcheck = 1.5.2 which has lot of bug fixes related to handling locking.

Kong/lua-resty-healthcheck@1.5.1...1.5.2

Kong/lua-resty-healthcheck#113 Kong/lua-resty-healthcheck#114 Kong/lua-resty-healthcheck#116

lua-resty-healthcheck version 1.5.2 does not even have log statement failed to release reported on this issue.

Is there any plan to release(ex: 2.8.2) kong with lua-resty-healthcheck = 1.5.2?