apache / apisix

The Cloud-Native API Gateway
https://apisix.apache.org/blog/
Apache License 2.0
14.3k stars 2.49k forks source link

feat: dns resolution for upstream nodes should not return IPs that are unavailable/faulty #10624

Open zuiyangqingzhou opened 9 months ago

zuiyangqingzhou commented 9 months ago

Current Behavior

The node with exception will still be forwarded traffic.

https://github.com/apache/apisix/blob/master/apisix/utils/upstream.lua#L70

According to the code here, in the case where the upstream is LB or domain name, dns parsing will be performed, but only an IP will be returned randomly.

There is a situation in which the randomly returned node happens to be the exception node.

Expected Behavior

Abnormal nodes should be removed and should not receive traffic

Error Logs

2023/12/09 22:36:56 [error] 15767#89433274: *42241 [lua] balancer.lua:363: run(): failed to pick server: failed to find valid upstream server, all upstream servers tried while connecting to upstream, client: 127.0.0.1, server: _, request: "GET /dns/test HTTP/1.1", upstream: "http://192.168.247.4:80/dns/test", host: "127.0.0.1:9080"

Steps to Reproduce

  1. Prepare two domain
    
    $ dig @127.0.0.1 www.mytest.com

www.mytest.com. 0 IN A 192.168.247.4 www.mytest.com. 0 IN A 192.168.247.2 www.mytest.com. 0 IN A 192.168.247.3

$ dig @127.0.0.1 www.mytemp.com

www.mytemp.com. 0 IN A 192.168.246.3 www.mytemp.com. 0 IN A 192.168.246.4 www.mytemp.com. 0 IN A 192.168.246.2

2.  both domains have a faulty node

$ curl http://192.168.247.4/ curl: (7) Failed to connect to 192.168.247.4 port 80 after 4888 ms: Couldn't connect to server

$ curl http://192.168.246.3/ curl: (7) Failed to connect to 192.168.246.3 port 80 after 4888 ms: Couldn't connect to server

3.  the complete configuration is as follows

{ "id": "490771170321239793", "create_time": 1702052012, "update_time": 1702132481, "uri": "/dns/test", "name": "dns_test", "methods": [ "GET", "POST", "PUT", "DELETE", "PATCH", "HEAD", "OPTIONS", "CONNECT", "TRACE" ], "upstream": { "nodes": { "www.mytemp.com:80": 1, "www.mytest.com:80": 1 }, "timeout": { "connect": 6, "send": 6, "read": 6 }, "type": "roundrobin", "checks": { "active": { "concurrency": 10, "healthy": { "http_statuses": [ 200, 302 ], "interval": 1, "successes": 2 }, "http_path": "/aa", "port": 80, "timeout": 1, "type": "http", "unhealthy": { "http_failures": 5, "http_statuses": [ 429, 404, 500, 501, 502, 503, 504, 505 ], "interval": 1, "tcp_failures": 2, "timeouts": 3 } } }, "scheme": "http", "pass_host": "pass", "keepalive_pool": { "idle_timeout": 60, "requests": 1000, "size": 320 } }, "status": 1 }

4. Initiate a request

curl http://127.0.0.1:9080/dns/test -i

5. there is a certain probability that an error will occur as follows

HTTP/1.1 502 Bad Gateway Date: Sat, 09 Dec 2023 14:36:21 GMT Content-Type: text/html; charset=utf-8 Content-Length: 154 Connection: keep-alive Server: APISIX/3.7.0 X-APISIX-Upstream-Status: 504 :

502 Bad Gateway

502 Bad Gateway


openresty


### Environment

- APISIX version (run `apisix version`): APISIX/3.7.0
- Operating system (run `uname -a`):  Darwin
- OpenResty / Nginx version (run `openresty -V` or `nginx -V`):  nginx version: openresty/1.21.4.2
- etcd version, if relevant (run `curl http://127.0.0.1:9090/v1/server_info`):
- APISIX Dashboard version, if relevant:
- Plugin runner version, for issues related to plugin runners:
- LuaRocks version, for installation issues (run `luarocks --version`):
shreemaan-abhishek commented 9 months ago

The healtcheck API requires some number of tries before the node is marked as unhealthy and removed. If both of your nodes are unavailable then it is guarateed that your reqeust will fail either way (by healtcheck mechanism or not).

zuiyangqingzhou commented 9 months ago

The healtcheck API requires some number of tries before the node is marked as unhealthy and removed. If both of your nodes are unavailable then it is guarateed that your reqeust will fail either way (by healtcheck mechanism or not).

I know what you mean, but for the same domain, although there are other nodes available, it cannot guarantee that the traffic will always be forwarded to healthy nodes, because the IP addresses resolved by domain are random.

sheharyaar commented 8 months ago

@shreemaan-abhishek , I would like to debug this.

sheharyaar commented 8 months ago

I was able to reproduce the issue, but this is not a bug, APISIX can only be aware of a failed node once it is unable to connect to the service, so ignoring the bad nodes even the first time, this should be a feature rather than a bug.

nitishfy commented 4 months ago

hey, could you tell me the steps that you performed to set up the domains locally on your system? cc: @sheharyaar @zuiyangqingzhou

shreemaan-abhishek commented 4 months ago

@zuiyangqingzhou, do these domains still contain one faulty node? image

zuiyangqingzhou commented 4 months ago

@zuiyangqingzhou, do these domains still contain one faulty node? image

Yes, you can refer to this.

image
zuiyangqingzhou commented 4 months ago

hey, could you tell me the steps that you performed to set up the domains locally on your system? cc: @sheharyaar @zuiyangqingzhou

You can use dnsmasq to build your local domain name resolution @nitishfy

gliffcheung commented 2 months ago

The usage of domain names in the upstream node makes it impossible to distinguish the healthy nodes. Prometheus scrapes the metric apisix_upstream_status using IP addresses instead of domain names, leaving us unaware of the corresponding node. Can't we just use domain name instead of IP in the healthcheck API? @shreemaan-abhishek @sheharyaar

gliffcheung commented 2 months ago

If two domains have the same ip, apisix may even use the domain name of an unhealthy node. https://github.com/apache/apisix/blob/master/apisix/balancer.lua#L110

        for _, node in ipairs(nodes) do
            if node.domain then
                local addr = node.host .. ":" .. node.port
                addr_to_domain[addr] = node.domain
            end
        end

https://github.com/apache/apisix/blob/master/apisix/balancer.lua#L261

    local domain = server_picker.addr_to_domain[server]

    res.domain = domain