apache / apisix

The Cloud-Native API Gateway
https://apisix.apache.org/blog/
Apache License 2.0
14.48k stars 2.52k forks source link

help request: apisix don't sync data from etcd #11390

Open jujiale opened 3 months ago

jujiale commented 3 months ago

Description

Hello,I suffered the following situation in our prd apisix cluster and one dev apisix node our prd has 4 env, each has 3 apisix instance, deployed with rpm, one cluster( we all it A in here) appear a odd thing, let me describe:

  1. we modify the cluster A config in apisix-dashboard, and we submit it, in etcd, I have found it is modify correctly, but when I use /v1/route/route_id, found that the whole config in cluster A instance is old version, and no matter how many times modify the config, the config in etcd is correactly, and the update_time is correct, but the config in instance is old, and the update time is very old, and nevery change. for example : etcd config

`

 /test/apisix/routes/515483732765836994
  {"id":"515483732765836994","create_time":1716781847,"update_time":1720085553,"uris": 
   ["/menu.service.query/m","/menu.service.query/pm/*"],"name":"aaa","priority":10,"methods":["GET","POST","PUT","DELETE","PATCH","HEAD","OPTIONS","CONNECT","TRACE"],"host":"xxx.com","upstream_id":"515483516306196172","status":1}

when I invoke /v1/route/route_id, config like below:

{
    "key": "/test/apisix/routes/515483732765836994",
    "createdIndex": 946,
    "has_domain": false,
    "clean_handlers": {},
    "modifiedIndex": 946,
    "update_count": 0,
    "orig_modifiedIndex": 946,
    "value": {
        "priority": 10,
        "host": "xxx.com",
        "name": "aaa",
        "methods": [
            "GET",
            "POST",
            "PUT",
            "DELETE",
            "PATCH",
            "HEAD",
            "OPTIONS",
            "CONNECT",
            "TRACE"
        ],
        "id": "515483732765836994",
        "uris": [
            "/menu.service.query/m",
            "/menu.service.query/w",
            "/menu.service.query/pm/*"
        ],
        "update_time": 1716781847,
        "create_time": 1716781847,
        "status": 1,
        "upstream_id": "515483516306196172"
    }
}

` we could see that the uris is not the same, and the update_time is not the same, but in other cluster, it works well

2.apisix log shows: note that the error log is consistent output, seems the issue occurs all the time.

`

    172.xx.61.52, server: _, request: "POST /menu.service.query/w HTTP/1.1", host: "xxx.com"
    2024/07/04 16:00:55 [error] 16235#16235: *143253446 [lua] config_util.lua:86: failed to find clean_handler with idx 1, client: 172.xx.61.47, server: _, request: "POST /menu.service.query/w HTTP/1.1", host: "xxx.com"
    2024/07/04 16:00:55 [error] 16234#16234: *143283913 [lua] config_etcd.lua:584: failed to fetch data from etcd: /test/apisix/apisix/core/config_util.lua:104: attempt to index local 'item' (a boolean value)
    stack traceback:
      /test/apisix/apisix/core/config_util.lua:104: in function 'fire_all_clean_handlers'
      /test/apisix/apisix/core/config_etcd.lua:315: in function 'sync_data'
      /test/apisix/apisix/core/config_etcd.lua:541: in function </test/apisix/apisix/core/config_etcd.lua:532>
      [C]: in function 'xpcall'
      /test/apisix/apisix/core/config_etcd.lua:532: in function </test/apisix/apisix/core/config_etcd.lua:513>,  etcd key: /test/apisix/upstreams, context: ngx.timer
    2024/07/04 16:00:55 [error] 16235#16235: *143280176 [lua] config_util.lua:86: failed to find clean_handler with idx 1, client: 172.xx.61.47, server: _, request: "POST /menu.service.validate/w HTTP/1.1", host: "xxx.com"
    2024/07/04 16:00:55 [error] 16240#16240: *143284010 [lua] config_etcd.lua:584: failed to fetch data from etcd: /test/apisix/apisix/core/config_util.lua:104: attempt to index local 'item' (a boolean value)
    stack traceback:
      /test/apisix/apisix/core/config_util.lua:104: in function 'fire_all_clean_handlers'
      /test/apisix/apisix/core/config_etcd.lua:315: in function 'sync_data'
      /test/apisix/apisix/core/config_etcd.lua:541: in function </test/apisix/apisix/core/config_etcd.lua:532>
      [C]: in function 'xpcall'
      /test/apisix/apisix/core/config_etcd.lua:532: in function </test/apisix/apisix/core/config_etcd.lua:513>,  etcd key: /test/apisix/janus/routes, context: ngx.timer

3.capture the 2379 port in apisix instance, found:

66
{"error":{"grpc_code":1,"http_code":408,"message":"context canceled","http_status":"Request Timeout"}}
0

` also found many request is timeout beyond 30s, as below: image

  1. I could confirm that the etcd is health, even I restart etcd, the scenario also exist. and apisix to etcd network is correct, some /v3/watch could return correctly, but apisix seems not use the config.

because we use 2.15.0 in prd env, so we could not upgrade it randomly

want to know if it is apisix bug, if it is , we plan merge some changes to solve it, and why config could not sync to apisix instance

Environment

jujiale commented 3 months ago

found in #8493 it also have the same error log, but it seems not methion the sync data issue, so I don't know if it is the same issue

jujiale commented 3 months ago

I try to modify the config_etcd.lua config_util.fire_all_clean_handlers(val) to config_util.fire_all_clean_handlers(false), which the error could the same as the above I mentioned, the data between etcd and apisix in not the same

yydance commented 2 weeks ago

今天似乎遇到了类似问题,dashboard新增了一条路由,etcd存储OK,但是apisix始终无法查到该路由,最终删除了原apisix pod后恢复正常,目前日志尚未看到相关信息