docs: set an upstream as a http fallback server

ryan4yin commented 1 year ago

Description

Our usage scenario is that we want to use APISIX to handle iterations between the new system and the old one.

Because the new system may have potential performance or stability problems running for a long time, to ensure the availability of the whole system, we implemented a deployment method to make APISIX passing requests to the new system by default, and use the old system as a fallback server.

Considering that others might have the same need, I created this issue to record it and wanted to discuss the possibility to add it to APISIX's FAQ.

related to:

@tzssangglass helped me to implement this feature, thanks again!

How to implement this

The whole workaround describes below.

First, create an upstream, and set the old system's priority to -1, thus the old system wil be marked as a backup server. It will be passed requests only when the primary servers(the new system) are unavailable.

curl -i -X PUT http://127.0.0.1:9180/apisix/admin/upstreams  -H "X-API-KEY: ${API_KEY}" -d '
{
    "id": "xxx-with-fallback",
    "desc": "xxx's upstream with the old system as backup nodes",
    "scheme": "http",
    "type": "roundrobin",
    "keepalive_pool": {
        "size": 200,
        "idle_timeout": 75,
        "requests": 1000
    },
    "nodes": [
        {
            "host": "<the-domain-of-new-system-1>",
            "port": 8080,
            "weight": 0
        },
        {
            "host": "<the-domain-of-new-system-2>",
            "port": 8080,
            "weight": 0
        },
        {
            "host": "<the-domain-of-the-old-system-1>",
            "port": 8080,
            "weight": 0,
            "priority": -1
        },
        {
            "host": "<the-domain-of-the-old-system-2>",
            "port": 8080,
            "weight": 0,
            "priority": -1
        }
    ],
    "retries": 1,
    "retry_timeout": 3,
    "timeout": {
        "connect": 1,
        "send": 1,
        "read": 1
    },
    # ... other configs ...
}'

and then we need to define in what scenario we think the new system is unavailable, so the request will be passed to the old system. so archive this goal, we should add configuration into APISIX's config.yaml：

apisix:
  node_listen: 8080             # APISIX listening port
  enable_heartbeat: true
  enable_admin: true
  enable_admin_cors: true
  enable_debug: false

  # ... other configurations

  http_server_configuration_snippet:      |
    # Add custom Nginx http server configuration to nginx.conf.
    # The configuration should be well indented!

    # Specifies in which cases a request should be passed to the next server(the fallback server with priority=-1): 
    proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504 non_idempotent;
    # in case you want to limits the number of possible tries for passing a request to the fallback server(the new system), enable this line:
    # proxy_next_upstream_tries 1;

  # ... other configurations

With those two configurations, through these configurations, we can implement the fallback feature mentioned above using APISIX. Generally, if there is a problem with the new system, the retry mechanism of APISIX we configured here will always be triggered, so the requests can always be processed properly, and no users are affected.

Drawbacks

This implementation is really helpful for me, but there is also some drawbacks:

APISIX's prometheus plguin seems do not have any metrics about retry, I can not monitor how frequently the retries happen through metrics. I have to analyse the retries through the access logs, because it records all the response_status in order.
the unit of upstream's timeout is seconds, so the minimal timeout for the primary server is 1 second, all requests will get stuck for 1 second before fallback to the fallback server, which is too long for high traffic, it really hurts if the primary system goes down completely! It's far better than 1 second if we can specify some thing like 100ms just like the timeout parameters in proxy-mirror:
```
plugin_attr:
  proxy-mirror:
    timeout:
      connect: 2000ms
      read: 2000ms
      send: 2000ms
```

shreemaan-abhishek commented 1 year ago

the unit of upstream's timeout is seconds, so the minimal timeout for the primary server is 1 second, all requests will get stuck for 1 second before fallback to the fallback server, which is too long for high traffic, it really hurts if the primary system goes down completely!

We can make the unit of upstream timeout in milliseconds to fix this. It should be a simple fix.

ryan4yin commented 1 year ago

@shreemaan-abhishek I'm really looking forward to this ❤️

shreemaan-abhishek commented 1 year ago

I just checked the code and realised that the minimal timeout should be greater than zero.

https://github.com/shreemaan-abhishek/apisix/blob/b14914f8e6849992ad41534a773d215ff07d19be/apisix/schema_def.lua#L113-L121

i.e if you provide the timeout as: {"connect": 0.1,"send": 0.1,"read": 0.1} you the effective timeout duration would be 100ms.

ryan4yin commented 1 year ago

@shreemaan-abhishek Ok, maybe I got it wrong. I'll take the time to confirm that.

shreemaan-abhishek commented 1 year ago

@ryan4yin do you have any further updates/questions? If not please close this issue. Thanks.

apache / apisix