Balancer: Runtime error

anujjalan commented 6 years ago

We are running kong version - 0.14.0 across 8 nodes with around 30 services. We are intermittently facing below errors -

2018/07/19 22:04:09 [error] 9850#0: *431460988 lua entry thread aborted: runtime error: /usr/local/share/lua/5.1/resty/dns/balancer.lua:881: attempt to index local 'address' (a nil value) stack traceback: coroutine 0: /usr/local/share/lua/5.1/resty/dns/balancer.lua: in function 'getPeer' /usr/local/share/lua/5.1/kong/runloop/balancer.lua:792: in function 'execute' /usr/local/share/lua/5.1/kong/runloop/handler.lua:637: in function 'after' /usr/local/share/lua/5.1/kong/init.lua:485: in function 'access' access_by_lua(nginx-kong.conf:87):2: in function <access_by_lua(nginx-kong.conf:87):1>, client: 172.31.11.45, server: kong, request: "GET /api/v1/device/d99c385b-dbf1-462f-ab9a-92c8d1dc1d1d HTTP/1.1", host: "segmentation.XXXX.in"

As said it happens intermittently and whenever occurs it results into 500. It happens in some of our services not all, do we know the reason for such error? If yes, can you help us in resolving this?

Tieske commented 6 years ago

@anujjalan sorry for the late reply, I must have somehow missed this.

Do you still have the error?

wernervrens commented 6 years ago

@Tieske , we are experiencing a similar issue. Added logs below for reference.

2018/11/16 11:11:40 [error] 12156#0: 29901430 [lua] balancer.lua:259: [healthchecks] failed setting peer status: no peer found by name 'a-api.us-west-1.amazonaws.com' and address 50.232.199.244:443, context: ngx.timer 2018/11/16 11:11:54 [error] 26257#0: 29890630 lua entry thread aborted: runtime error: /usr/local/share/lua/5.1/resty/dns/balancer.lua:881: attempt to index local 'address' (a nil value) stack traceback: coroutine 0: /usr/local/share/lua/5.1/resty/dns/balancer.lua: in function 'getPeer' /usr/local/share/lua/5.1/kong/core/balancer.lua:776: in function 'execute' /usr/local/share/lua/5.1/kong/core/handler.lua:678: in function 'after' /usr/local/share/lua/5.1/kong/init.lua:503: in function 'access' access_by_lua(nginx-kong.conf:114):2: in function <access_by_lua(nginx-kong.conf:114):1>, client: 18.212.32.122, server: kong, request: "POST /Advertisers/IR000/Campaigns/7777/Media/1234567/Iterable HTTP/1.1", host: "api.impact.com" 2018/11/16 11:12:05 [error] 26257#0: *29891295 lua entry thread aborted: runtime error: /usr/local/share/lua/5.1/resty/dns/balancer.lua:143: more indices requested to be added (834) than provided (557) for host 'a-api.us-west-1.amazonaws.com:443' (50.232.199.244) stack traceback: coroutine 0: [C]: in function 'error' /usr/local/share/lua/5.1/resty/dns/balancer.lua:143: in function 'addIndices' /usr/local/share/lua/5.1/resty/dns/balancer.lua:714: in function 'redistributeIndices' /usr/local/share/lua/5.1/resty/dns/balancer.lua:467: in function 'queryDns' /usr/local/share/lua/5.1/resty/dns/balancer.lua:587: in function 'getPeer' /usr/local/share/lua/5.1/resty/dns/balancer.lua:881: in function 'getPeer' /usr/local/share/lua/5.1/kong/core/balancer.lua:776: in function 'execute' /usr/local/share/lua/5.1/kong/core/handler.lua:678: in function 'after' /usr/local/share/lua/5.1/kong/init.lua:503: in function 'access' access_by_lua(nginx-kong.conf:114):2: in function <access_by_lua(nginx-kong.conf:114):1>, client: 34.207.159.28, server: kong, request: "POST /Advertisers/IR000/Campaigns/7777/Media/1234567/Iterable HTTP/1.1", host: "api.impact.com"

Tieske commented 6 years ago

@wernervrens what Kong version are you using? Are you using Route 53? (All cases so far seem to be on amazon)

Can you reproduce the error? I could sent a debug version, to help track this down

dliberman commented 5 years ago

Hello @Tieske.

We’re having a similar (or the same) problem with our Kong servers. It’s possible to reproduce on Kong EE 0.33 and 0.34, and also on Kong CE 1.0.2.

We have created 2 AWS S3 buckets to make them upstreams targets. If we enable healthchecks in this upstream and stress requests to this target, we see this problem happening. First we saw it in production, but now we can consistently reproduce in a local test box.

The error is basically this on the logs: 2019/02/13 13:01:34 [error] 5233#0: *181385 lua entry thread aborted: runtime error: /usr/local/share/lua/5.1/resty/dns/balancer.lua:881: attempt to index local 'address' (a nil value) stack traceback: coroutine 0: /usr/local/share/lua/5.1/resty/dns/balancer.lua: in function 'getPeer' /usr/local/share/lua/5.1/kong/core/balancer.lua:815: in function 'execute' /usr/local/share/lua/5.1/kong/core/handler.lua:699: in function 'after' /usr/local/share/lua/5.1/kong/init.lua:627: in function 'access' access_by_lua(nginx.conf:130):2: in function <access_by_lua(nginx.conf:130):1>, client: 192.168.121.1, server: kong, request: "GET /v1/f174909b-40f1-4d71-glub-4fa86702b4ab.json HTTP/1.1", host: "192.168.121.239:8000" … then it returns error 500 on the request, obviously.

Steps to reproduce - create the following config (upstream, set both targets, a service and a route):

Upstream: http://kong:8001/upstreams/buckets {"created_at":1550082423,"hash_on":"none","id":"119f5a00-352d-449d-998f-db9654fd8a5f","name":"buckets","hash_fallback_header":null,"hash_on_cookie":null,"healthchecks":{"active":{"unhealthy":{"http_statuses":[429,404,500,501,502,503,504,505],"tcp_failures":0,"timeouts":0,"http_failures":0,"interval":5},"type":"http","http_path":"\/empty.json","timeout":1,"healthy":{"successes":2,"interval":5,"http_statuses":[200,301,302]},"https_sni":null,"https_verify_certificate":true,"concurrency":10},"passive":{"unhealthy":{"http_failures":2,"http_statuses":[400,401,403,404,429,500,501,502,503],"tcp_failures":2,"timeouts":2},"healthy":{"http_statuses":[200,301,302],"successes":2},"type":"http"}},"hash_on_cookie_path":"\/","hash_fallback":"none","hash_on_header":null,"slots":10000}

Upstream targets: http://kong:8001/upstreams/buckets/targets {"next":null,"data":[{"created_at":1550082438.316,"upstream":{"id":"119f5a00-352d-449d-998f-db9654fd8a5f"},"id":"3bfb5efe-5072-4df6-95d1-fa4a6745bff5","target":"circuit-breakers-test2.s3-website.us-east-2.amazonaws.com:80","weight":1},{"created_at":1550082432.494,"upstream":{"id":"119f5a00-352d-449d-998f-db9654fd8a5f"},"id":"139b5b02-8cf8-47b3-803e-b81371f83639","target":"circuit-breakers-test1.s3-website-sa-east-1.amazonaws.com:80","weight":1000}]}

Service: http://kong:8001/services/buckets {"host":"buckets","created_at":1550065411,"connect_timeout":60000,"id":"53b159aa-3410-4226-9f25-0ce2784001bb","protocol":"http","name":"buckets","read_timeout":60000,"port":80,"path":null,"updated_at":1550065411,"retries":5,"write_timeout":60000}

Route: http://kong:8001/services/buckets/routes {"next":null,"data":[{"created_at":1550065419,"methods":null,"id":"3a22cf2b-604c-4fd1-9cba-3db2649e47cd","service":{"id":"53b159aa-3410-4226-9f25-0ce2784001bb"},"name":null,"hosts":null,"updated_at":1550065419,"preserve_host":false,"regex_priority":0,"paths":["\/"],"sources":null,"destinations":null,"snis":null,"protocols":["http","https"],"strip_path":true}]}

Then, add a simple file on both S3 buckets. Start making requests(at least 200/s) to this file on this bucket, e.g.:

curl -sX GET "http://kong:8000/v1/f174909b-40f1-4d71-ae5f-4fa86702b4ab.json"

After a few seconds you can see these errors happening already. We also tested with kong.conf option dns_valid_ttl=120. It’s a little trickier but we could also reproduce. We either leave it a very long time running under stress to reproduce, or do something different: we flood it for a few minutes with about 100k requests (100 clients in parallel) wait for a few minutes with no traffic, then start again the 100 clients.

For this test we used a debug version sent by Kong team that patches lua dns code attachment (lua-resty-dns-client-2.2.0-1.rockspec). There are logs attached with the problem happening, with both options dns_valid_ttl unset or set to 120. lua-resty-dns-client-2.2.0-1.all.rock.gz log_debug.tar.gz log_debug-dns_valid_ttl_120.tar.gz

Please let us know if you need more information on this issue, or even test it with a different debug version, as we can easily reproduce it.

Tieske commented 5 years ago

I think I tracked down the issue. #64 contains the test showing the faulty behaviour.

ping @UkiahSmith

Tieske commented 5 years ago

The fix has been merged. The downstream update of Kong is tracked here: https://github.com/Kong/kong/pull/3965

So I'll close this now.

dliberman commented 5 years ago

Hello @Tieske! When I apply this fix (I saw version 3.0.1-1 already out) combined with Kong/kong#3965, I notice errors on every request as reported on my comment here. Could you please check?

Thank you!

Kong / lua-resty-dns-client

Balancer: Runtime error #49