Closed wklken closed 1 day ago
two upstream:
the first upstream did not resume immediately
and it seems the parse_domain_in_route
did not know the dns server is resumed
the comment (from https://github.com/apache/apisix/commit/ce4d8fb1e55f94c03c3a641f8f0d1a62c189ceaf)
-- don't modify the modifiedIndex to avoid plugin cache miss because of DNS resolve result
-- has changed
while the parse_domain_in_up
modify the modifiedIndex
Hi, thank you for your report. Could you give us a more detailed description of your configuration?
I have reproduce @jiangfucheng
just mock dns fail in lua (not 100%, shoud retry for few times)
info
/get
with one upstream, one node like httpbin.org
; the parsed data in log is "nodes":[{"host":"54.175.87.239","priority":1,"domain":"httpbin.org","port":80,"weight":100,"upstream_host":"httpbin.org"}]
apisix/utils/upstream.lua
to mock dns random error, note the 1693542089
is a now() + 60
local function parse_domain_for_nodes(nodes)
local new_nodes = core.table.new(#nodes, 0)
for _, node in ipairs(nodes) do
local host = node.host
if not ipmatcher.parse_ipv4(host) and
not ipmatcher.parse_ipv6(host) then
if host == "httpbin.org" and ngx_now() <= 1693542089 then
if random(1,10) % 2 == 0 then
core.log.error("101 empty records")
return new_nodes
end
-- return nil, "failed to query the DNS server: timeout11"
end
local ip, err = core.resolver.parse_domain(host)
if ip then
local new_node = core.table.clone(node)
new_node.host = ip
new_node.domain = host
core.table.insert(new_nodes, new_node)
end
apisix reload
wrk -c2 -t2 -d80s http://{}/get
curl http://{}/get
for about 10 times, got 200 and 503 when 503, there got no dns
logs before no valid upstream node
2023/09/01 04:24:20 [info] 140701#140701: *15533385 [lua] radixtree_uri_with_parameter.lua:53: match(): route match mode: radixtree_uri_with_parameter
2023/09/01 04:24:20 [info] 140701#140701: *15533385 [lua] init.lua:572: http_access_phase(): matched route: {"key":"/bk-gateway-apisix/routes/test.prod.2423","modifiedIndex":532500,"clean_handlers":"table: 0x7f7d645c64c8","has_domain":true,"value":{"create_time":1693309224,"methods":["GET"],"service_id":"test.prod.stage-12","update_time":1693539224,"timeout":{"send":150,"read":150,"connect":150},"labels":{"gateway.bk.tencent.com/gateway":"test","gateway.bk.tencent.com/stage":"prod"},"uris":["/api/test/prod/dns","/api/test/prod/dns/"],"status":1,"priority":0,"name":"test-prod-dns","id":"test.prod.2423","plugins":{"bk-proxy-rewrite":{"match_subpath":false,"subpath_param_name":":ext","method":"GET","use_real_request_uri_unsafe":false,"uri":"/get"},"bk-resource-context":{"bk_resource_auth":{"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false,"verified_user_required":false},"bk_resource_id":2423,"bk_resource_name":"dns","bk_resource_auth_obj":{"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false,"verified_user_required":false}}},"upstream":{"nodes":{},"hash_on":"vars","scheme":"http","parent":{"key":"/bk-gateway-apisix/routes/test.prod.2423","modifiedIndex":532500,"clean_handlers":{},"has_domain":true,"value":"table: 0x7f7d9706c708","update_count":0,"orig_modifiedIndex":532500,"createdIndex":532332},"type":"roundrobin","pass_host":"node","timeout":{"send":150,"read":150,"connect":150}}},"update_count":0,"orig_modifiedIndex":532500,"createdIndex":532332
2023/09/01 04:24:20 [info] 140701#140701: *15533385 [lua] plugin.lua:630: merge_service_route(): route conf: {"key":"/bk-gateway-apisix/routes/test.prod.2423","modifiedIndex":532500,"clean_handlers":"table: 0x7f7d645c64c8","has_domain":true,"value":{"create_time":1693309224,"methods":["GET"],"service_id":"test.prod.stage-12","update_time":1693539224,"timeout":{"send":150,"read":150,"connect":150},"labels":{"gateway.bk.tencent.com/gateway":"test","gateway.bk.tencent.com/stage":"prod"},"uris":["/api/test/prod/dns","/api/test/prod/dns/"],"status":1,"priority":0,"name":"test-prod-dns","id":"test.prod.2423","plugins":{"bk-proxy-rewrite":{"match_subpath":false,"subpath_param_name":":ext","method":"GET","use_real_request_uri_unsafe":false,"uri":"/get"},"bk-resource-context":{"bk_resource_auth":{"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false,"verified_user_required":false},"bk_resource_id":2423,"bk_resource_name":"dns","bk_resource_auth_obj":{"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false,"verified_user_required":false}}},"upstream":{"nodes":{},"hash_on":"vars","scheme":"http","parent":{"key":"/bk-gateway-apisix/routes/test.prod.2423","modifiedIndex":532500,"clean_handlers":{},"has_domain":true,"value":"table: 0x7f7d9706c708","update_count":0,"orig_modifiedIndex":532500,"createdIndex":532332},"type":"roundrobin","pass_host":"node","timeout":{"send":150,"read":150,"connect":150}}},"update_count":0,"orig_modifiedIndex":532500,"createdIndex":532332}
2023/09/01 04:24:21 [error] 140701#140701: *15534047 [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node
the diff of route conf
between 503 and 200
route.value.upstream.nodes
is empty, so the new_nodes
is empty(no dns resolve), the compare_upstream_node
result true and return
so the route.value.upsteram.nodes
is being set to an empty value somewhere in the code.
@wklken Hi, could you paste the APISIX configurations? It's hard to reproduce this bug if no detailed configurations.
@jiangfucheng I still trying to reproduce it in apisix docker-compose, not success yet. I will update the issue when success. (it only can be reproduced on our image build from apache/apisix:3.2.1-centos
)
@jiangfucheng Finally, It been reproduced on the apisix docker-compose. It take a lot of time , add I add some scripts to help.
here is the steps, can you please help to investigate this? (this is the problem which stuck our version for production)
git clone https://github.com/apache/apisix-docker.git
vim apisix_conf/master/config.yaml
, change the router
and nginx_config. worker_processes
apisix:
node_listen: 9080 # APISIX listening port
enable_ipv6: false
router:
http: radixtree_uri_with_parameter
nginx_config: worker_processes: 4
3. `vim docker-compose-master.yaml` change the `image` to 3.2.1-centos
```yaml
services:
apisix:
image: "apache/apisix:3.2.1-centos"
docker-compose -p docker-apisix -f docker-compose-master.yaml up
start the apisix and etcdall in the container
docker exec -it dockerapisix_apisix_1 /bin/bash
conf/config.yaml
is right, 4 processes, and the apisix version
is 3.2.1vi apisix/utils/upstream.lua
, add code to mock the dns failure, 50% will 101 empty records
-- to line 22
local random = math.random
-- add before core.resolver.parse_domain(host)
local function parse_domain_for_nodes(nodes)
local new_nodes = core.table.new(#nodes, 0)
for _, node in ipairs(nodes) do
local host = node.host
if not ipmatcher.parse_ipv4(host) and
not ipmatcher.parse_ipv6(host) then
if host == "httpbin.org" and ngx_now() <= 1695089345 then
if random(1,10) % 2 == 0 then
core.log.error("101 empty records")
return new_nodes
end
end
local ip, err = core.resolver.parse_domain(host)
create.sh
to register the route and service, check the API_KEY is right
#!/bin/bash
API_KEY="edd1c9f034335f136f87ad84b625c8f1" ROUTE_ID="dns_route" SERVVICE_ID="dns_service"
curl http://127.0.0.1:9180/apisix/admin/services/${SERVVICE_ID} -H "X-API-KEY: ${API_KEY}" -X PUT -d ' { "upstream": { "nodes": [ { "host": "httpbin.org", "port": 80, "weight": 100, "priority": 1 } ], "type": "roundrobin", "scheme": "http", "pass_host": "node" } }'
curl http://127.0.0.1:9180/apisix/admin/routes/${ROUTE_ID} -H "X-API-KEY: ${API_KEY}" -X PUT -d ' { "uri": "/api/test/prod/dns22", "methods": [ "GET" ], "plugins": { "proxy-rewrite": { "method": "GET", "uri": "/get" } }, "upstream": { "nodes": [ { "host": "httpbin.org", "port": 80, "weight": 100, "priority": 1 } ], "type": "roundrobin", "scheme": "http", "pass_host": "node" }, "service_id": "dns_service", "status": 1 }'
5. add script `update_lua.sh` to change the timestamp fast
```bash
#!/bin/bash
now=$(date "+%s")
echo "now is: ${now}"
A=$(date -d "+30 seconds" "+%s")
echo "will change the condition to <= ${A}"
sed -i -r "s/<= ([0-9]+) then/<= ${A} then/g" apisix/utils/upstream.lua
echo $?
echo "change done"
apisix reload
bash -x create.sh
curl -vv http://0.0.0.0:9080/api/test/prod/dns22
not in the container
start_and_check.sh
for benchmark then check 503
#!/bin/bash
date
url="http://0.0.0.0:9080/api/test/prod/dns22"
echo "start bench wrk" wrk -c2 -t2 -d35s ${url}
date
echo "sleep 5 s" sleep 5
echo "check the status code 10 times"
for ((i=1; i<=10; i++)) do status_code=$(curl --write-out %{http_code} --silent --output /dev/null $url) echo "status=$status_code" if [ "${status_code}" -eq "503" ] then echo "503 show" exit fi done
## 4. reproduce
> open two window to run step 1/2 at the same time
1. in container, run
$ ./update_lua.sh now is: 1695089267 will change the condition to <= 1695089297 0 change done /usr/local/openresty//luajit/bin/luajit ./apisix/cli/apisix.lua reload
WARNING: using fixed Admin API token has security risk. Please modify "admin_key" in conf/config.yaml .
2023/09/19 02:07:47 [notice] 157#157: signal process started
2. out the container, run
$ bash start_and_check.sh
Tue Sep 19 10:08:36 CST 2023 start bench wrk Running 35s test @ http://0.0.0.0:9080/api/test/prod/dns22 2 threads and 2 connections Thread Stats Avg Stdev Max +/- Stdev Latency 15.22ms 80.61ms 823.04ms 95.99% Req/Sec 6.91k 807.96 7.77k 95.97% 461582 requests in 35.00s, 202.93MB read Non-2xx or 3xx responses: 461575 Requests/sec: 13187.64 Transfer/sec: 5.80MB Tue Sep 19 10:09:11 CST 2023 sleep 5 s check the status code 10 times status=503 503 show
3. if `503 show`, means it been reproduced! you can `curl -vv http://0.0.0.0:9080/api/test/prod/dns22` for few times to check
curl -vv http://0.0.0.0:9080/api/test/prod/dns22
GET /api/test/prod/dns22 HTTP/1.1 User-Agent: curl/7.29.0 Host: 0.0.0.0:9080 Accept: /
< HTTP/1.1 503 Service Temporarily Unavailable < Date: Tue, 19 Sep 2023 02:21:58 GMT < Content-Type: text/html; charset=utf-8 < Content-Length: 269 < Connection: keep-alive < Server: APISIX/3.2.1 <
503 Service Temporarily Unavailable 503 Service Temporarily Unavailable
openresty Powered by APISIX.
5. if all `200`, do step 1/2 again(it took 50 seconds each time, and it not 100% show 503, but if you try about 10 times, high possibility showed).
More clues for detecting:
route.upstream
, and the route
should related to a service
with upstream
will cause the problem
apisix reload
, will all 200503
only 1 log 2023/09/19 02:34:52 [error] 854#854: *112694 [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node, client: 172.31.0.1, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "0.0.0.0:9080"
; it did not call the parse_domain
, maybe some cache?
2023/09/19 06:26:24 [info] 50022#50022: *5076712 [lua] radixtree_uri_with_parameter.lua:53: match(): route match mode: radixtree_uri_with_parameter, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
2023/09/19 06:26:24 [info] 50022#50022: *5076712 [lua] init.lua:572: http_access_phase(): matched route: {"key":"/bk-gateway-apisix/routes/dns_route","modifiedIndex":543015,"createdIndex":543015,"clean_handlers":{},"update_count":0,"orig_modifiedIndex":543015,"has_domain":true,"value":{"id":"dns_route","create_time":1695037859,"uri":"/api/test/prod/dns22","priority":0,"status":1,"update_time":1695037859,"service_id":"dns_service","methods":["GET"],"upstream":{"hash_on":"vars","scheme":"http","pass_host":"node","type":"roundrobin","nodes":{},"parent":{"key":"/bk-gateway-apisix/routes/dns_route","modifiedIndex":543015,"createdIndex":543015,"clean_handlers":"table: 0x7f666206ded0","update_count":0,"orig_modifiedIndex":543015,"has_domain":true,"value":"table: 0x7f66622f1d28"}},"plugins":{"proxy-rewrite":{"method":"GET","uri":"/get","use_real_request_uri_unsafe":false}}}}, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
2023/09/19 06:26:24 [info] 50022#50022: *5076712 [lua] plugin.lua:629: merge_service_route(): service conf: {"key":"/bk-gateway-apisix/services/dns_service","modifiedIndex":543014,"createdIndex":543014,"clean_handlers":{},"has_domain":true,"value":{"update_time":1695037858,"id":"dns_service","upstream":{"hash_on":"vars","scheme":"http","pass_host":"node","type":"roundrobin","nodes":[{"port":80,"weight":100,"priority":1,"host":"httpbin.org"}],"parent":{"key":"/bk-gateway-apisix/services/dns_service","modifiedIndex":543014,"createdIndex":543014,"clean_handlers":"table: 0x7f66620322b8","has_domain":true,"value":"table: 0x7f66620684b0"}},"create_time":1695037858}}, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
2023/09/19 06:26:24 [info] 50022#50022: *5076712 [lua] plugin.lua:630: merge_service_route(): route conf: {"key":"/bk-gateway-apisix/routes/dns_route","modifiedIndex":543015,"createdIndex":543015,"clean_handlers":{},"update_count":0,"orig_modifiedIndex":543015,"has_domain":true,"value":{"id":"dns_route","create_time":1695037859,"uri":"/api/test/prod/dns22","priority":0,"status":1,"update_time":1695037859,"service_id":"dns_service","methods":["GET"],"upstream":{"hash_on":"vars","scheme":"http","pass_host":"node","type":"roundrobin","nodes":{},"parent":{"key":"/bk-gateway-apisix/routes/dns_route","modifiedIndex":543015,"createdIndex":543015,"clean_handlers":"table: 0x7f666206ded0","update_count":0,"orig_modifiedIndex":543015,"has_domain":true,"value":"table: 0x7f66622f1d28"}},"plugins":{"proxy-rewrite":{"method":"GET","uri":"/get","use_real_request_uri_unsafe":false}}}}, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
2023/09/19 06:26:24 [error] 50022#50022: *5076712 [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
I add some log, the image is more clear, it will never recover from 503 until apisix reload
at apisix/init.lua
and, it would be assign to empty nodes here
@jiangfucheng @Revolyssup Could someone please verify this?
@wklken Thanks for your detailed description. Based on your description, I think there are two issues that need to be answered.
failed to set upstream: no valid upstream node
after parse domain failed.
I think you already found the reason for this: parse_domain_for_nodes()
function should return nil, error
if dns parse failed.http_access_phase()
function print the matched route with a nil
node field in your second request?
https://github.com/apache/apisix/blob/9b2031a61c8af9c010a923eea89c0dd029e0d971/apisix/init.lua#L494-L504I think it was caused by api_ctx.matched_route = route
this line, it changed the route config in memory. (Note that I think it did not just change the copy table but changed the origin route table).
For the above two issues, I think we just fix the return value in parse_domain_for_nodes
function enough.
For other issues, I will continue to try to debug the code to give my opinion.
Another, I have a question, in the above description, you commented handle_upstream will never call parse_domain_for_nodes function
in the second request, How do you get the conclusion of this, I find the parse_domain_for_nodes
function is not record related logs to indicate the function has been executed.
(I have been quite busy with work recently, and I will help you troubleshoot this issue as soon as possible in my spare time)
@jiangfucheng , in the log you can see that @wklken has added I think some error logs from his side 101 empty record
inside parse_domain_for_nodes
function.
@wklken Can you show the snippet of parse_domain_for_nodes
function where this log is?
@jiangfucheng api_ctx.matched_route
is per request so even if it is set wrong, the subsequent request will set it again. In the next request, you can see that api_ctx.matched_route has empty nodes in upstream after it is populated by match() function.
The whole issue boils down to, why the matched_route log has no nodes? This is calculated for each request from the etcd. I haven't been able to find the exact flow that can explain where exactly did this nodes={} from previous upstream resulted in api_ctx.matched_route's nodes when api_ctx is freshly calculated per request. And where exactly is the caching happening at that point while matching route.
The only caching culprit I see currently is this https://github.com/apache/apisix/blob/57a7dcdac63a0bda55ddd41fc56a10c6144bfd51/apisix/http/router/radixtree_uri_with_parameter.lua#L38
But modification in _M.user_routes is only done directly by etcd.
@wklken Just to confirm, it only happens when the route has service id, right?
@wklken I think parse_domain_for_nodes is called in subsequent request but there are no nodes to iterate over, so you don't get the log.
@wklken Just to confirm, it only happens when the route has service id, right?
Yes, route has upstream and has service_id (the service have upstream too)
@jiangfucheng
api_ctx.matched_route
is per request so even if it is set wrong, the subsequent request will set it again. In the next request, you can see that api_ctx.matched_route has empty nodes in upstream after it is populated by match() function. The whole issue boils down to, why the matched_route log has no nodes? This is calculated for each request from the etcd. I haven't been able to find the exact flow that can explain where exactly did this nodes={} from previous upstream resulted in api_ctx.matched_route's nodes when api_ctx is freshly calculated per request. And where exactly is the caching happening at that point while matching route.The only caching culprit I see currently is this
But modification in _M.user_routes is only done directly by etcd.
You are right, it's my mistake, api_ctx
will be created when a new request comes, it does not reuse it.
@wklken I think parse_domain_for_nodes is called in subsequent request but there are no nodes to iterate over, so you don't get the log.
Agree with this opinion.
@wklken In the first request, did you get a log similar to route items:
or insert uri route:
???
@wklken In the first request, did you get a log similar to
route items:
orinsert uri route:
???
My log level is info
and there no route items:
or insert uri route:
where the first empty nodes
present.
@wklken Before empty nodes came, did you find this? What was the last time this log came?
2023/09/19 06:46:16 [info] 51339#51339: *5157502 [lua] route.lua:73: create_radixtree_uri_router(): insert uri route: {"id":"dns_route","service_id":"dns_service","create_time":1695037859,"uri":"/api/test/prod/dns22","status":1,"methods":["GET"],"update_time":1695037859,"upstream":{"type":"roundrobin","pass_host":"node","scheme":"http","nodes":[{"priority":1,"port":80,"host":"httpbin.org","weight":100}],"parent":{"has_domain":true,"createdIndex":543015,"value":{"id":"dns_route","service_id":"dns_service","create_time":1695037859,"uri":"/api/test/prod/dns22","status":1,"methods":"table: 0x7f66620fc778","update_time":1695037859,"upstream":"table: 0x7f6661b40ed0","priority":0,"plugins":"table: 0x7f666236e638"},"modifiedIndex":543015,"orig_modifiedIndex":543015,"key":"/bk-gateway-apisix/routes/dns_route","update_count":0,"clean_handlers":{}},"hash_on":"vars"},"priority":0,"plugins":{"proxy-rewrite":{"method":"GET","use_real_request_uri_unsafe":false,"uri":"/get"}}}\
..... a lot of other insert uri route
2023/09/19 06:46:17 empty nodes
@Revolyssup 1 second before the empty nodes
cc: @kingluo
https://github.com/apache/apisix/issues/10093#issuecomment-1725524437
For the above two issues, I think we just fix the return value in
parse_domain_for_nodes
function enough.
@jiangfucheng we need your help to confirm if that fix would be right, thanks.(for our production, we need to reduce the risk)
According to your answer above, can we just add a check before return new_nodes
inf local function parse_domain_for_nodes(nodes)
?
local function parse_domain_for_nodes(nodes)
.....
if #new_nodes == 0 then
return nil, "no valid nodes after dns resolver domain: " .. host
end
return new_nodes
@Revolyssup the issue should not be closed before the official repo fix the bug? and we still don't know the reason.
our own repo need to fix the production so we patched it. It will resolve the empty nodes
problem!
@wklken Yep I understand. The patch still doesn't fix the original issue, its kind of a hack
Hi @wklken , can you please recheck if this is fixed or not. I think this may have been related to #10722 .
@sheharyaar ok, I will recheck it later
@sheharyaar Unfortunately, the bug is not fixed by #10722
Follow the reproduce steps here: https://github.com/apache/apisix/issues/10093#issuecomment-1724748805
Ok, I will have a look
@wklken I have retried the reproduction steps several times but still cannot reproduce the issue. I do not get 503 after the normal operation is resumed.
Today I tried to reproduce this issue from all the ways possible along with the exact steps given in the issue . But this problem doesn't come and from the code logic also it doesnt look like the error will come. parse_domain_for_nodes function will never get empty nodes list which seem to be causing the problem. route.value.upstream.nodes is never directly updated so parse_domain_for_nodes will always get the httpbin.org in nodes, and as soon as the dns server resumes, the node value is retrieved and the route.dns_value is updated.
@wklken I have followed your exact reproduction steps multiple times but I never got the error.
@Revolyssup did you follow those steps here? https://github.com/apache/apisix/issues/10093#issuecomment-1724748805
and the apisix version should be 3.2.1;
we just patch it https://github.com/TencentBlueKing/blueking-apigateway-apisix/blob/master/src/build/patches/002_upstream_parse_domain_for_nodes.patch and it's on our production for about 10 months, and it works.
@Revolyssup did you follow those steps here? https://github.com/apache/apisix/issues/10093#issuecomment-1724748805
and the apisix version should be 3.2.1;
we just patch it https://github.com/TencentBlueKing/blueking-apigateway-apisix/blob/master/src/build/patches/002_upstream_parse_domain_for_nodes.patch and it's on our production for about 10 months, and it works.
Yes, I had followed the steps.
@wklken Just want to confirm the behaviour after this patch. So before the patch when empty nodes were returned, you were intermittently getting 503 error with the requests. But now when you errror out early, you still get 500 error for "no valid ip found". But the subsequent requests when the dns comes back up, no error is found. right?
Basically you have covered the case that nodes will never be set as empty.
https://github.com/apache/apisix/issues/10093#issue-1869257753
The requests will be all fail, while the dns resolution not ok failed to parse domain: test.com, error: failed to query the DNS server: dns client error: 101 empty record received;
But when the dns server is stable, the dns resolution of apisix did not resume immediately, the requests got error [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node for hours, until we do apisix reload, the dns turns ok.
It’s always showing a 503 error even if the DNS is okay. I think there may be some cache that makes APISIX unaware of the DNS resume.
The requests will be all fail, while the dns resolution not ok failed to parse domain: test.com, error: failed to query the DNS server: dns client error: 101 empty record received; But when the dns server is stable, the dns resolution of apisix did not resume immediately, the requests got error [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node for hours, until we do apisix reload, the dns turns ok.
It’s always showing a 503 error even if the DNS is okay. I think there may be some cache that makes APISIX unaware of the DNS resume.
@wklken I meant to ask the behavior in your patch. After your patch, you do not get the 503 after dns server resumes. Right?
@Revolyssup Yes, with the patch, no 503 shows
I haven't been able to reproduce but by the error and how your patch fixes the error and the logs above that you showed, here is the conclusion so far:
Somehow the nodes in the original saved configuration are set to empty. Since this goes away after a reload means the etcd still has correct value. Since I cant reproduce, I am going through the code to find the root cause currently.
cache issue is unlikely because api_ctx is created newly for each new request. So for next request, new api_ctx will be created and the matched route should contain the correct routes configuration with the nodes field.
Tried multiple times following the exact steps with docker compose given above, still getting no 503s.
Same is the case when tried locally.
Since after trying multiple times, this cannot be reproduced. I am closing the issue. If anyone else is able to reproduce this bug again then please reopen with the exact steps you folllowed.
Current Behavior
In our production, when the dns server reload frequently. The requests will be all fail, while the dns resolution not ok
failed to parse domain: test.com, error: failed to query the DNS server: dns client error: 101 empty record received
;But when the dns server is stable, the dns resolution of apisix did not resume immediately, the requests got error
[lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node
for hours, until we doapisix reload
, the dns turns ok.And from the log, one apisix instance with 4 processes, 3 processes got the dns failed, 1 processes got no error all the requests are ok.
apisix/core/dns/client.lua
how the
client.resolve
detect the dns server is ok?Expected Behavior
dns resolution resume immediately
Error Logs
when the dns server is reloading
when the dns server is stable (no
dns
error logs, why?)so, it means when the dns server is stable, the
parse_domain
success, no error logs, but thehandle_upstream
still getno valid upstream node
and after
apisix reload
, no error logsset_upstream
fail?Steps to Reproduce
For now, I cannot reproduce it, but it has occurred three times in our production. Any clues or advice for making dns server reload frequently?
Environment
apisix version
): 3.2.1uname -a
):openresty -V
ornginx -V
): nginx version: openresty/1.21.4.1curl http://127.0.0.1:9090/v1/server_info
):luarocks --version
):