Open zuiyangqingzhou opened 11 months ago
The healtcheck API requires some number of tries before the node is marked as unhealthy and removed. If both of your nodes are unavailable then it is guarateed that your reqeust will fail either way (by healtcheck mechanism or not).
The healtcheck API requires some number of tries before the node is marked as unhealthy and removed. If both of your nodes are unavailable then it is guarateed that your reqeust will fail either way (by healtcheck mechanism or not).
I know what you mean, but for the same domain, although there are other nodes available, it cannot guarantee that the traffic will always be forwarded to healthy nodes, because the IP addresses resolved by domain are random.
@shreemaan-abhishek , I would like to debug this.
I was able to reproduce the issue, but this is not a bug, APISIX can only be aware of a failed node once it is unable to connect to the service, so ignoring the bad nodes even the first time, this should be a feature rather than a bug.
hey, could you tell me the steps that you performed to set up the domains locally on your system? cc: @sheharyaar @zuiyangqingzhou
@zuiyangqingzhou, do these domains still contain one faulty node?
@zuiyangqingzhou, do these domains still contain one faulty node?
Yes, you can refer to this.
hey, could you tell me the steps that you performed to set up the domains locally on your system? cc: @sheharyaar @zuiyangqingzhou
You can use dnsmasq
to build your local domain name resolution @nitishfy
The usage of domain names in the upstream node makes it impossible to distinguish the healthy nodes.
Prometheus scrapes the metric apisix_upstream_status
using IP addresses instead of domain names, leaving us unaware of the corresponding node.
Can't we just use domain name instead of IP in the healthcheck API? @shreemaan-abhishek @sheharyaar
If two domains have the same ip, apisix may even use the domain name of an unhealthy node. https://github.com/apache/apisix/blob/master/apisix/balancer.lua#L110
for _, node in ipairs(nodes) do
if node.domain then
local addr = node.host .. ":" .. node.port
addr_to_domain[addr] = node.domain
end
end
https://github.com/apache/apisix/blob/master/apisix/balancer.lua#L261
local domain = server_picker.addr_to_domain[server]
res.domain = domain
Current Behavior
The node with exception will still be forwarded traffic.
https://github.com/apache/apisix/blob/master/apisix/utils/upstream.lua#L70
According to the code here, in the case where the upstream is LB or domain name, dns parsing will be performed, but only an IP will be returned randomly.
There is a situation in which the randomly returned node happens to be the exception node.
Expected Behavior
Abnormal nodes should be removed and should not receive traffic
Error Logs
2023/12/09 22:36:56 [error] 15767#89433274: *42241 [lua] balancer.lua:363: run(): failed to pick server: failed to find valid upstream server, all upstream servers tried while connecting to upstream, client: 127.0.0.1, server: _, request: "GET /dns/test HTTP/1.1", upstream: "http://192.168.247.4:80/dns/test", host: "127.0.0.1:9080"
Steps to Reproduce
www.mytest.com. 0 IN A 192.168.247.4 www.mytest.com. 0 IN A 192.168.247.2 www.mytest.com. 0 IN A 192.168.247.3
$ dig @127.0.0.1 www.mytemp.com
www.mytemp.com. 0 IN A 192.168.246.3 www.mytemp.com. 0 IN A 192.168.246.4 www.mytemp.com. 0 IN A 192.168.246.2
$ curl http://192.168.247.4/ curl: (7) Failed to connect to 192.168.247.4 port 80 after 4888 ms: Couldn't connect to server
$ curl http://192.168.246.3/ curl: (7) Failed to connect to 192.168.246.3 port 80 after 4888 ms: Couldn't connect to server
{ "id": "490771170321239793", "create_time": 1702052012, "update_time": 1702132481, "uri": "/dns/test", "name": "dns_test", "methods": [ "GET", "POST", "PUT", "DELETE", "PATCH", "HEAD", "OPTIONS", "CONNECT", "TRACE" ], "upstream": { "nodes": { "www.mytemp.com:80": 1, "www.mytest.com:80": 1 }, "timeout": { "connect": 6, "send": 6, "read": 6 }, "type": "roundrobin", "checks": { "active": { "concurrency": 10, "healthy": { "http_statuses": [ 200, 302 ], "interval": 1, "successes": 2 }, "http_path": "/aa", "port": 80, "timeout": 1, "type": "http", "unhealthy": { "http_failures": 5, "http_statuses": [ 429, 404, 500, 501, 502, 503, 504, 505 ], "interval": 1, "tcp_failures": 2, "timeouts": 3 } } }, "scheme": "http", "pass_host": "pass", "keepalive_pool": { "idle_timeout": 60, "requests": 1000, "size": 320 } }, "status": 1 }
curl http://127.0.0.1:9080/dns/test -i
HTTP/1.1 502 Bad Gateway Date: Sat, 09 Dec 2023 14:36:21 GMT Content-Type: text/html; charset=utf-8 Content-Length: 154 Connection: keep-alive Server: APISIX/3.7.0 X-APISIX-Upstream-Status: 504 :
502 Bad Gateway