apache / apisix

The Cloud-Native API Gateway
https://apisix.apache.org/blog/
Apache License 2.0
13.96k stars 2.45k forks source link

bug: dns resolution did not resume immediately after the dns server resume #10093

Closed wklken closed 1 day ago

wklken commented 10 months ago

Current Behavior

In our production, when the dns server reload frequently. The requests will be all fail, while the dns resolution not ok failed to parse domain: test.com, error: failed to query the DNS server: dns client error: 101 empty record received;

But when the dns server is stable, the dns resolution of apisix did not resume immediately, the requests got error [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node for hours, until we do apisix reload, the dns turns ok.

And from the log, one apisix instance with 4 processes, 3 processes got the dns failed, 1 processes got no error all the requests are ok.


apisix/core/dns/client.lua

function _M.resolve(self, domain, selector)
    local client = self.client

    local answers, err = client.resolve(domain)
    if not answers then
        return nil, "failed to query the DNS server: " .. err
    end

how the client.resolve detect the dns server is ok?

Expected Behavior

dns resolution resume immediately

Error Logs

when the dns server is reloading

[lua] upstream.lua:79: parse_domain_for_nodes(): dns resolver domain: test.com error: failed to query the DNS server: dns client error: 101 empty record received
[lua] resolver.lua:80: parse_domain(): failed to parse domain: test.com, error: failed to query the DNS server: dns client error: 101 empty record received

when the dns server is stable (no dns error logs, why?)

[lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node

so, it means when the dns server is stable, the parse_domain success, no error logs, but the handle_upstream still get no valid upstream node

and after apisix reload, no error logs

  1. dns client parse success, but return empty new_nodes? (cache?)
  2. dns client parse success and return new_nodes, but set_upstream fail?

Steps to Reproduce

For now, I cannot reproduce it, but it has occurred three times in our production. Any clues or advice for making dns server reload frequently?

Environment

wklken commented 10 months ago

two upstream:

  1. route.upstream
  2. route.service_id -> service.upstream

the first upstream did not resume immediately

https://github.com/apache/apisix/blob/77704832ec91117f5ca7171811ae5f0d3f1494fe/apisix/init.lua#L422-L445

and it seems the parse_domain_in_route did not know the dns server is resumed

https://github.com/apache/apisix/blob/77704832ec91117f5ca7171811ae5f0d3f1494fe/apisix/init.lua#L210-L239

the comment (from https://github.com/apache/apisix/commit/ce4d8fb1e55f94c03c3a641f8f0d1a62c189ceaf)

    -- don't modify the modifiedIndex to avoid plugin cache miss because of DNS resolve result
    -- has changed

while the parse_domain_in_up modify the modifiedIndex

https://github.com/apache/apisix/blob/77704832ec91117f5ca7171811ae5f0d3f1494fe/apisix/utils/upstream.lua#L90-L112

jiangfucheng commented 10 months ago

Hi, thank you for your report. Could you give us a more detailed description of your configuration?

  1. What is the local DNS server you are running locally? And we need the related configuration.
  2. Could you paste your related configuration of APISIX?
wklken commented 10 months ago

I have reproduce @jiangfucheng

just mock dns fail in lua (not 100%, shoud retry for few times)

apisix config:

  1. start apisix with 4 processes
  2. change then nginx error log level to info

register route:

  1. add a route /get with one upstream, one node like httpbin.org; the parsed data in log is "nodes":[{"host":"54.175.87.239","priority":1,"domain":"httpbin.org","port":80,"weight":100,"upstream_host":"httpbin.org"}]
  2. change apisix/utils/upstream.lua to mock dns random error, note the 1693542089 is a now() + 60
local function parse_domain_for_nodes(nodes) 
    local new_nodes = core.table.new(#nodes, 0)
    for _, node in ipairs(nodes) do
        local host = node.host
        if not ipmatcher.parse_ipv4(host) and
                not ipmatcher.parse_ipv6(host) then

            if host == "httpbin.org" and ngx_now() <= 1693542089 then
                if random(1,10) % 2 == 0 then
                        core.log.error("101 empty records")
                        return new_nodes
                end
                -- return nil, "failed to query the DNS server: timeout11"
            end
            local ip, err = core.resolver.parse_domain(host)
            if ip then
                local new_node = core.table.clone(node)
                new_node.host = ip
                new_node.domain = host
                core.table.insert(new_nodes, new_node)
            end
  1. apisix reload

do requests

  1. wrk -c2 -t2 -d80s http://{}/get
  2. and after the wrk finished, do curl http://{}/get for about 10 times, got 200 and 503

the log

when 503, there got no dns logs before no valid upstream node

2023/09/01 04:24:20 [info] 140701#140701: *15533385 [lua] radixtree_uri_with_parameter.lua:53: match(): route match mode: radixtree_uri_with_parameter
2023/09/01 04:24:20 [info] 140701#140701: *15533385 [lua] init.lua:572: http_access_phase(): matched route: {"key":"/bk-gateway-apisix/routes/test.prod.2423","modifiedIndex":532500,"clean_handlers":"table: 0x7f7d645c64c8","has_domain":true,"value":{"create_time":1693309224,"methods":["GET"],"service_id":"test.prod.stage-12","update_time":1693539224,"timeout":{"send":150,"read":150,"connect":150},"labels":{"gateway.bk.tencent.com/gateway":"test","gateway.bk.tencent.com/stage":"prod"},"uris":["/api/test/prod/dns","/api/test/prod/dns/"],"status":1,"priority":0,"name":"test-prod-dns","id":"test.prod.2423","plugins":{"bk-proxy-rewrite":{"match_subpath":false,"subpath_param_name":":ext","method":"GET","use_real_request_uri_unsafe":false,"uri":"/get"},"bk-resource-context":{"bk_resource_auth":{"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false,"verified_user_required":false},"bk_resource_id":2423,"bk_resource_name":"dns","bk_resource_auth_obj":{"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false,"verified_user_required":false}}},"upstream":{"nodes":{},"hash_on":"vars","scheme":"http","parent":{"key":"/bk-gateway-apisix/routes/test.prod.2423","modifiedIndex":532500,"clean_handlers":{},"has_domain":true,"value":"table: 0x7f7d9706c708","update_count":0,"orig_modifiedIndex":532500,"createdIndex":532332},"type":"roundrobin","pass_host":"node","timeout":{"send":150,"read":150,"connect":150}}},"update_count":0,"orig_modifiedIndex":532500,"createdIndex":532332
2023/09/01 04:24:20 [info] 140701#140701: *15533385 [lua] plugin.lua:630: merge_service_route():   route conf: {"key":"/bk-gateway-apisix/routes/test.prod.2423","modifiedIndex":532500,"clean_handlers":"table: 0x7f7d645c64c8","has_domain":true,"value":{"create_time":1693309224,"methods":["GET"],"service_id":"test.prod.stage-12","update_time":1693539224,"timeout":{"send":150,"read":150,"connect":150},"labels":{"gateway.bk.tencent.com/gateway":"test","gateway.bk.tencent.com/stage":"prod"},"uris":["/api/test/prod/dns","/api/test/prod/dns/"],"status":1,"priority":0,"name":"test-prod-dns","id":"test.prod.2423","plugins":{"bk-proxy-rewrite":{"match_subpath":false,"subpath_param_name":":ext","method":"GET","use_real_request_uri_unsafe":false,"uri":"/get"},"bk-resource-context":{"bk_resource_auth":{"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false,"verified_user_required":false},"bk_resource_id":2423,"bk_resource_name":"dns","bk_resource_auth_obj":{"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false,"verified_user_required":false}}},"upstream":{"nodes":{},"hash_on":"vars","scheme":"http","parent":{"key":"/bk-gateway-apisix/routes/test.prod.2423","modifiedIndex":532500,"clean_handlers":{},"has_domain":true,"value":"table: 0x7f7d9706c708","update_count":0,"orig_modifiedIndex":532500,"createdIndex":532332},"type":"roundrobin","pass_host":"node","timeout":{"send":150,"read":150,"connect":150}}},"update_count":0,"orig_modifiedIndex":532500,"createdIndex":532332}
2023/09/01 04:24:21 [error] 140701#140701: *15534047 [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node

the diff of route conf between 503 and 200

image
wklken commented 10 months ago

route.value.upstream.nodes is empty, so the new_nodes is empty(no dns resolve), the compare_upstream_node result true and return

image

so the route.value.upsteram.nodes is being set to an empty value somewhere in the code.

jiangfucheng commented 10 months ago

@wklken Hi, could you paste the APISIX configurations? It's hard to reproduce this bug if no detailed configurations.

wklken commented 10 months ago

@jiangfucheng I still trying to reproduce it in apisix docker-compose, not success yet. I will update the issue when success. (it only can be reproduced on our image build from apache/apisix:3.2.1-centos)

wklken commented 9 months ago

@jiangfucheng Finally, It been reproduced on the apisix docker-compose. It take a lot of time , add I add some scripts to help.

here is the steps, can you please help to investigate this? (this is the problem which stuck our version for production)


Reproduce steps:

1. use the docker-compose

  1. git clone https://github.com/apache/apisix-docker.git
  2. vim apisix_conf/master/config.yaml, change the router and nginx_config. worker_processes

    
    apisix:
    node_listen: 9080              # APISIX listening port
    enable_ipv6: false
    
    router:
    http: radixtree_uri_with_parameter

nginx_config: worker_processes: 4

3. `vim docker-compose-master.yaml` change the `image` to 3.2.1-centos
```yaml
services:
  apisix:
    image: "apache/apisix:3.2.1-centos"
  1. docker-compose -p docker-apisix -f docker-compose-master.yaml up start the apisix and etcd

2. enter the container, do some change to mock the dns failure

all in the container

  1. docker exec -it dockerapisix_apisix_1 /bin/bash
  2. check the conf/config.yaml is right, 4 processes, and the apisix version is 3.2.1
  3. vi apisix/utils/upstream.lua, add code to mock the dns failure, 50% will 101 empty records
-- to line 22

local random = math.random

-- add before core.resolver.parse_domain(host)

local function parse_domain_for_nodes(nodes)
    local new_nodes = core.table.new(#nodes, 0)
    for _, node in ipairs(nodes) do
        local host = node.host
        if not ipmatcher.parse_ipv4(host) and
                not ipmatcher.parse_ipv6(host) then
            if host == "httpbin.org" and ngx_now() <= 1695089345 then
                if random(1,10) % 2 == 0 then
                        core.log.error("101 empty records")
                        return new_nodes
                end
            end
            local ip, err = core.resolver.parse_domain(host)
  1. add a script create.sh to register the route and service, check the API_KEY is right
    
    #!/bin/bash

API_KEY="edd1c9f034335f136f87ad84b625c8f1" ROUTE_ID="dns_route" SERVVICE_ID="dns_service"

curl http://127.0.0.1:9180/apisix/admin/services/${SERVVICE_ID} -H "X-API-KEY: ${API_KEY}" -X PUT -d ' { "upstream": { "nodes": [ { "host": "httpbin.org", "port": 80, "weight": 100, "priority": 1 } ], "type": "roundrobin", "scheme": "http", "pass_host": "node" } }'

curl http://127.0.0.1:9180/apisix/admin/routes/${ROUTE_ID} -H "X-API-KEY: ${API_KEY}" -X PUT -d ' { "uri": "/api/test/prod/dns22", "methods": [ "GET" ], "plugins": { "proxy-rewrite": { "method": "GET", "uri": "/get" } }, "upstream": { "nodes": [ { "host": "httpbin.org", "port": 80, "weight": 100, "priority": 1 } ], "type": "roundrobin", "scheme": "http", "pass_host": "node" }, "service_id": "dns_service", "status": 1 }'

5. add script `update_lua.sh` to change the timestamp fast
```bash
#!/bin/bash

now=$(date "+%s")
echo "now is: ${now}"

A=$(date -d "+30 seconds" "+%s")
echo "will change the condition to <= ${A}"
sed -i -r "s/<= ([0-9]+) then/<= ${A} then/g"  apisix/utils/upstream.lua
echo $?
echo "change done"

apisix reload
  1. do the register
    bash -x create.sh
  2. check the url is ok not in the container
    curl -vv http://0.0.0.0:9080/api/test/prod/dns22

3. not in the container, add check script

not in the container

  1. install wrk
  2. add script start_and_check.sh for benchmark then check 503
    
    #!/bin/bash

date

url="http://0.0.0.0:9080/api/test/prod/dns22"

echo "start bench wrk" wrk -c2 -t2 -d35s ${url}

date

echo "sleep 5 s" sleep 5

echo "check the status code 10 times"

for ((i=1; i<=10; i++)) do status_code=$(curl --write-out %{http_code} --silent --output /dev/null $url) echo "status=$status_code" if [ "${status_code}" -eq "503" ] then echo "503 show" exit fi done


## 4. reproduce

> open two window to run step 1/2 at the same time

1. in container, run 

$ ./update_lua.sh now is: 1695089267 will change the condition to <= 1695089297 0 change done /usr/local/openresty//luajit/bin/luajit ./apisix/cli/apisix.lua reload

WARNING: using fixed Admin API token has security risk. Please modify "admin_key" in conf/config.yaml .

2023/09/19 02:07:47 [notice] 157#157: signal process started

2. out the container, run

$ bash start_and_check.sh

Tue Sep 19 10:08:36 CST 2023 start bench wrk Running 35s test @ http://0.0.0.0:9080/api/test/prod/dns22 2 threads and 2 connections Thread Stats Avg Stdev Max +/- Stdev Latency 15.22ms 80.61ms 823.04ms 95.99% Req/Sec 6.91k 807.96 7.77k 95.97% 461582 requests in 35.00s, 202.93MB read Non-2xx or 3xx responses: 461575 Requests/sec: 13187.64 Transfer/sec: 5.80MB Tue Sep 19 10:09:11 CST 2023 sleep 5 s check the status code 10 times status=503 503 show

3. if `503 show`, means it been reproduced! you can `curl -vv http://0.0.0.0:9080/api/test/prod/dns22` for few times to check

curl -vv http://0.0.0.0:9080/api/test/prod/dns22

wklken commented 9 months ago

More clues for detecting:

  1. should be route.upstream, and the route should related to a service with upstream will cause the problem
    • route without upstream - ok
    • route with upstream, without service_id - ok
  2. 4 workers, only part of them will return 503
  3. if apisix reload, will all 200
  4. the 503 only 1 log 2023/09/19 02:34:52 [error] 854#854: *112694 [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node, client: 172.31.0.1, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "0.0.0.0:9080"; it did not call the parse_domain, maybe some cache?
  5. the error log of 503 after dns is ok
    
    2023/09/19 06:26:24 [info] 50022#50022: *5076712 [lua] radixtree_uri_with_parameter.lua:53: match(): route match mode: radixtree_uri_with_parameter, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
    2023/09/19 06:26:24 [info] 50022#50022: *5076712 [lua] init.lua:572: http_access_phase(): matched route: {"key":"/bk-gateway-apisix/routes/dns_route","modifiedIndex":543015,"createdIndex":543015,"clean_handlers":{},"update_count":0,"orig_modifiedIndex":543015,"has_domain":true,"value":{"id":"dns_route","create_time":1695037859,"uri":"/api/test/prod/dns22","priority":0,"status":1,"update_time":1695037859,"service_id":"dns_service","methods":["GET"],"upstream":{"hash_on":"vars","scheme":"http","pass_host":"node","type":"roundrobin","nodes":{},"parent":{"key":"/bk-gateway-apisix/routes/dns_route","modifiedIndex":543015,"createdIndex":543015,"clean_handlers":"table: 0x7f666206ded0","update_count":0,"orig_modifiedIndex":543015,"has_domain":true,"value":"table: 0x7f66622f1d28"}},"plugins":{"proxy-rewrite":{"method":"GET","uri":"/get","use_real_request_uri_unsafe":false}}}}, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
    2023/09/19 06:26:24 [info] 50022#50022: *5076712 [lua] plugin.lua:629: merge_service_route(): service conf: {"key":"/bk-gateway-apisix/services/dns_service","modifiedIndex":543014,"createdIndex":543014,"clean_handlers":{},"has_domain":true,"value":{"update_time":1695037858,"id":"dns_service","upstream":{"hash_on":"vars","scheme":"http","pass_host":"node","type":"roundrobin","nodes":[{"port":80,"weight":100,"priority":1,"host":"httpbin.org"}],"parent":{"key":"/bk-gateway-apisix/services/dns_service","modifiedIndex":543014,"createdIndex":543014,"clean_handlers":"table: 0x7f66620322b8","has_domain":true,"value":"table: 0x7f66620684b0"}},"create_time":1695037858}}, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
    2023/09/19 06:26:24 [info] 50022#50022: *5076712 [lua] plugin.lua:630: merge_service_route():   route conf: {"key":"/bk-gateway-apisix/routes/dns_route","modifiedIndex":543015,"createdIndex":543015,"clean_handlers":{},"update_count":0,"orig_modifiedIndex":543015,"has_domain":true,"value":{"id":"dns_route","create_time":1695037859,"uri":"/api/test/prod/dns22","priority":0,"status":1,"update_time":1695037859,"service_id":"dns_service","methods":["GET"],"upstream":{"hash_on":"vars","scheme":"http","pass_host":"node","type":"roundrobin","nodes":{},"parent":{"key":"/bk-gateway-apisix/routes/dns_route","modifiedIndex":543015,"createdIndex":543015,"clean_handlers":"table: 0x7f666206ded0","update_count":0,"orig_modifiedIndex":543015,"has_domain":true,"value":"table: 0x7f66622f1d28"}},"plugins":{"proxy-rewrite":{"method":"GET","uri":"/get","use_real_request_uri_unsafe":false}}}}, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
    2023/09/19 06:26:24 [error] 50022#50022: *5076712 [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
wklken commented 9 months ago

I add some log, the image is more clear, it will never recover from 503 until apisix reload

image

at apisix/init.lua

image

and, it would be assign to empty nodes here

image

https://github.com/apache/apisix/blob/9b2031a61c8af9c010a923eea89c0dd029e0d971/apisix/utils/upstream.lua#L64-L87


@jiangfucheng @Revolyssup Could someone please verify this?

jiangfucheng commented 9 months ago

@wklken Thanks for your detailed description. Based on your description, I think there are two issues that need to be answered.

  1. Why APISIX will print error log with failed to set upstream: no valid upstream node after parse domain failed. I think you already found the reason for this: parse_domain_for_nodes() function should return nil, error if dns parse failed.

https://github.com/apache/apisix/blob/9b2031a61c8af9c010a923eea89c0dd029e0d971/apisix/utils/upstream.lua#L64-L86

  1. Why the http_access_phase() function print the matched route with a nil node field in your second request? https://github.com/apache/apisix/blob/9b2031a61c8af9c010a923eea89c0dd029e0d971/apisix/init.lua#L494-L504

I think it was caused by api_ctx.matched_route = route this line, it changed the route config in memory. (Note that I think it did not just change the copy table but changed the origin route table).

For the above two issues, I think we just fix the return value in parse_domain_for_nodes function enough.

For other issues, I will continue to try to debug the code to give my opinion.

Another, I have a question, in the above description, you commented handle_upstream will never call parse_domain_for_nodes function in the second request, How do you get the conclusion of this, I find the parse_domain_for_nodes function is not record related logs to indicate the function has been executed.

(I have been quite busy with work recently, and I will help you troubleshoot this issue as soon as possible in my spare time)

Revolyssup commented 9 months ago

@jiangfucheng , in the log you can see that @wklken has added I think some error logs from his side 101 empty record inside parse_domain_for_nodes function.

@wklken Can you show the snippet of parse_domain_for_nodes function where this log is?

Revolyssup commented 9 months ago

@jiangfucheng api_ctx.matched_route is per request so even if it is set wrong, the subsequent request will set it again. In the next request, you can see that api_ctx.matched_route has empty nodes in upstream after it is populated by match() function. The whole issue boils down to, why the matched_route log has no nodes? This is calculated for each request from the etcd. I haven't been able to find the exact flow that can explain where exactly did this nodes={} from previous upstream resulted in api_ctx.matched_route's nodes when api_ctx is freshly calculated per request. And where exactly is the caching happening at that point while matching route.

The only caching culprit I see currently is this https://github.com/apache/apisix/blob/57a7dcdac63a0bda55ddd41fc56a10c6144bfd51/apisix/http/router/radixtree_uri_with_parameter.lua#L38

But modification in _M.user_routes is only done directly by etcd.

Revolyssup commented 9 months ago

@wklken Just to confirm, it only happens when the route has service id, right?

Revolyssup commented 9 months ago

@wklken I think parse_domain_for_nodes is called in subsequent request but there are no nodes to iterate over, so you don't get the log.

wklken commented 9 months ago

@wklken Just to confirm, it only happens when the route has service id, right?

Yes, route has upstream and has service_id (the service have upstream too)

jiangfucheng commented 9 months ago

@jiangfucheng api_ctx.matched_route is per request so even if it is set wrong, the subsequent request will set it again. In the next request, you can see that api_ctx.matched_route has empty nodes in upstream after it is populated by match() function. The whole issue boils down to, why the matched_route log has no nodes? This is calculated for each request from the etcd. I haven't been able to find the exact flow that can explain where exactly did this nodes={} from previous upstream resulted in api_ctx.matched_route's nodes when api_ctx is freshly calculated per request. And where exactly is the caching happening at that point while matching route.

The only caching culprit I see currently is this

https://github.com/apache/apisix/blob/57a7dcdac63a0bda55ddd41fc56a10c6144bfd51/apisix/http/router/radixtree_uri_with_parameter.lua#L38

But modification in _M.user_routes is only done directly by etcd.

You are right, it's my mistake, api_ctx will be created when a new request comes, it does not reuse it.

jiangfucheng commented 9 months ago

@wklken I think parse_domain_for_nodes is called in subsequent request but there are no nodes to iterate over, so you don't get the log.

Agree with this opinion.

Revolyssup commented 9 months ago

@wklken In the first request, did you get a log similar to route items: or insert uri route:???

wklken commented 9 months ago

@wklken In the first request, did you get a log similar to route items: or insert uri route:???

My log level is info and there no route items: or insert uri route: where the first empty nodes present.

Revolyssup commented 9 months ago

@wklken Before empty nodes came, did you find this? What was the last time this log came?

wklken commented 9 months ago
2023/09/19 06:46:16 [info] 51339#51339: *5157502 [lua] route.lua:73: create_radixtree_uri_router(): insert uri route: {"id":"dns_route","service_id":"dns_service","create_time":1695037859,"uri":"/api/test/prod/dns22","status":1,"methods":["GET"],"update_time":1695037859,"upstream":{"type":"roundrobin","pass_host":"node","scheme":"http","nodes":[{"priority":1,"port":80,"host":"httpbin.org","weight":100}],"parent":{"has_domain":true,"createdIndex":543015,"value":{"id":"dns_route","service_id":"dns_service","create_time":1695037859,"uri":"/api/test/prod/dns22","status":1,"methods":"table: 0x7f66620fc778","update_time":1695037859,"upstream":"table: 0x7f6661b40ed0","priority":0,"plugins":"table: 0x7f666236e638"},"modifiedIndex":543015,"orig_modifiedIndex":543015,"key":"/bk-gateway-apisix/routes/dns_route","update_count":0,"clean_handlers":{}},"hash_on":"vars"},"priority":0,"plugins":{"proxy-rewrite":{"method":"GET","use_real_request_uri_unsafe":false,"uri":"/get"}}}\

..... a lot of other insert uri route

2023/09/19 06:46:17  empty nodes  

@Revolyssup 1 second before the empty nodes

Revolyssup commented 9 months ago

cc: @kingluo

wklken commented 9 months ago

https://github.com/apache/apisix/issues/10093#issuecomment-1725524437

For the above two issues, I think we just fix the return value in parse_domain_for_nodes function enough.

@jiangfucheng we need your help to confirm if that fix would be right, thanks.(for our production, we need to reduce the risk)

According to your answer above, can we just add a check before return new_nodes inf local function parse_domain_for_nodes(nodes)?

local function parse_domain_for_nodes(nodes)
    .....

    if #new_nodes == 0 then
      return nil, "no valid nodes after dns resolver domain: " .. host
    end

    return new_nodes
Revolyssup commented 8 months ago

Closed by https://github.com/TencentBlueKing/blueking-apigateway-apisix/pull/52

wklken commented 8 months ago

@Revolyssup the issue should not be closed before the official repo fix the bug? and we still don't know the reason.

our own repo need to fix the production so we patched it. It will resolve the empty nodes problem!

Revolyssup commented 8 months ago

@wklken Yep I understand. The patch still doesn't fix the original issue, its kind of a hack

sheharyaar commented 5 months ago

Hi @wklken , can you please recheck if this is fixed or not. I think this may have been related to #10722 .

wklken commented 5 months ago

@sheharyaar ok, I will recheck it later

wklken commented 5 months ago

@sheharyaar Unfortunately, the bug is not fixed by #10722

Follow the reproduce steps here: https://github.com/apache/apisix/issues/10093#issuecomment-1724748805

  1. reproduce, 503 show
  2. patch the changes, apisix reload
  3. reproduce, 503 show
sheharyaar commented 5 months ago

Ok, I will have a look

Revolyssup commented 3 days ago

@wklken I have retried the reproduction steps several times but still cannot reproduce the issue. I do not get 503 after the normal operation is resumed.

Revolyssup commented 3 days ago

Today I tried to reproduce this issue from all the ways possible along with the exact steps given in the issue . But this problem doesn't come and from the code logic also it doesnt look like the error will come. parse_domain_for_nodes function will never get empty nodes list which seem to be causing the problem. route.value.upstream.nodes is never directly updated so parse_domain_for_nodes will always get the httpbin.org in nodes, and as soon as the dns server resumes, the node value is retrieved and the route.dns_value is updated.

Revolyssup commented 3 days ago

@wklken I have followed your exact reproduction steps multiple times but I never got the error.

wklken commented 2 days ago

@Revolyssup did you follow those steps here? https://github.com/apache/apisix/issues/10093#issuecomment-1724748805

and the apisix version should be 3.2.1;


we just patch it https://github.com/TencentBlueKing/blueking-apigateway-apisix/blob/master/src/build/patches/002_upstream_parse_domain_for_nodes.patch and it's on our production for about 10 months, and it works.

Revolyssup commented 2 days ago

@Revolyssup did you follow those steps here? https://github.com/apache/apisix/issues/10093#issuecomment-1724748805

and the apisix version should be 3.2.1;


we just patch it https://github.com/TencentBlueKing/blueking-apigateway-apisix/blob/master/src/build/patches/002_upstream_parse_domain_for_nodes.patch and it's on our production for about 10 months, and it works.

Yes, I had followed the steps.

Revolyssup commented 2 days ago

@wklken Just want to confirm the behaviour after this patch. So before the patch when empty nodes were returned, you were intermittently getting 503 error with the requests. But now when you errror out early, you still get 500 error for "no valid ip found". But the subsequent requests when the dns comes back up, no error is found. right?

Basically you have covered the case that nodes will never be set as empty.

wklken commented 2 days ago

https://github.com/apache/apisix/issues/10093#issue-1869257753

The requests will be all fail, while the dns resolution not ok failed to parse domain: test.com, error: failed to query the DNS server: dns client error: 101 empty record received;

But when the dns server is stable, the dns resolution of apisix did not resume immediately, the requests got error [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node for hours, until we do apisix reload, the dns turns ok.

It’s always showing a 503 error even if the DNS is okay. I think there may be some cache that makes APISIX unaware of the DNS resume.

Revolyssup commented 2 days ago

#10093 (comment)

The requests will be all fail, while the dns resolution not ok failed to parse domain: test.com, error: failed to query the DNS server: dns client error: 101 empty record received;

But when the dns server is stable, the dns resolution of apisix did not resume immediately, the requests got error [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node for hours, until we do apisix reload, the dns turns ok.

It’s always showing a 503 error even if the DNS is okay. I think there may be some cache that makes APISIX unaware of the DNS resume.

@wklken I meant to ask the behavior in your patch. After your patch, you do not get the 503 after dns server resumes. Right?

wklken commented 2 days ago

@Revolyssup Yes, with the patch, no 503 shows

Revolyssup commented 2 days ago

I haven't been able to reproduce but by the error and how your patch fixes the error and the logs above that you showed, here is the conclusion so far:

Somehow the nodes in the original saved configuration are set to empty. Since this goes away after a reload means the etcd still has correct value. Since I cant reproduce, I am going through the code to find the root cause currently.

Revolyssup commented 2 days ago

cache issue is unlikely because api_ctx is created newly for each new request. So for next request, new api_ctx will be created and the matched route should contain the correct routes configuration with the nodes field.

Revolyssup commented 2 days ago

Screenshot from 2024-07-03 16-59-17 Tried multiple times following the exact steps with docker compose given above, still getting no 503s.

Same is the case when tried locally.

Revolyssup commented 1 day ago

Since after trying multiple times, this cannot be reproduced. I am closing the issue. If anyone else is able to reproduce this bug again then please reopen with the exact steps you folllowed.