3scale / APIcast

3scale API Gateway
Apache License 2.0
305 stars 171 forks source link

Ensure we don't schedule reloading configuration when reloading is disabled #1468

Closed coderbydesign closed 3 months ago

coderbydesign commented 4 months ago

Details of the issue as well as local reproducers are documented in https://github.com/3scale/APIcast/issues/1467.

This change ensures that when boot.init_worker checks to see whether or not the configuration needs to be reloaded, it also checks to confirm that the boot.ttl() value set from APICAST_CONFIGURATION_CACHE is a positive number, in order to avoid an immediate and infinite recursive loop of refreshing the configuration.

This should still preserve the initial fix [1] where this regression appears to have been introduced, as we should never schedule reloading with the configuration settings of:

APICAST_CONFIGURATION_LOADER: boot
APICAST_CONFIGURATION_CACHE: -1

We should also never have a situation where we'd need to account for interval = 0 since we fail out in boot.init in that case [2].

Tested locally as documented in the issue, however based on the comment in the initial fix PR, I've not added any tests in code. Happy to do so if there's now a good way to accomplish that!

[1] https://github.com/3scale/APIcast/pull/1399 [2] https://github.com/3scale/APIcast/blob/7e7eaf6f1d584c78d999c0d09f5b65203161c402/gateway/src/apicast/configuration_loader.lua#L146-L149

tkan145 commented 3 months ago

LGTM, @eguzki need your feedback as well

eguzki commented 3 months ago

Thanks @coderbydesign for the fix.

Your fix is indeed right. I overlooked completely the use case for APICAST_CONFIGURATION_CACHE=-1.

I want to propose an alternative fix, more readable IMHO:

diff --git a/gateway/src/apicast/configuration_loader.lua b/gateway/src/apicast/configuration_loader.lua
index f1eff77e..ec128fe9 100644
--- a/gateway/src/apicast/configuration_loader.lua
+++ b/gateway/src/apicast/configuration_loader.lua
@@ -204,22 +204,15 @@ function boot.init_worker(configuration)
     schedule(interval, handler, ...)
   end

-  -- Check whether the reserved boot configuration is fresh or stale.
-  -- If it is stale, refresh configuration
-  -- When a worker process is (re-)spawned,
-  -- it will start working with fresh (according the ttl semantics) configuration
-  local boot_reserved_hosts = configuration:find_by_host(boot_reserved_domain, false)
-  if(#boot_reserved_hosts == 0)
-  then
-    -- the boot configuration has expired, load fresh config
-    ngx.log(ngx.INFO, 'boot time configuration has expired')
-    -- ngx.socket.tcp is not available at the init or init_worker phases,
-    -- it needs to be scheduled (with delay = 0)
-    schedule(0, handler, configuration)
-  elseif(interval > 0)
-  then
-    ngx.log(ngx.DEBUG, 'schedule new configuration loading')
-    schedule(interval, handler, configuration)
+  if interval > 0 then
+      -- Check whether the reserved boot configuration is fresh or stale.
+      -- If it is stale, refresh configuration
+      -- When a worker process is (re-)spawned,
+      -- it will start working with fresh (according the ttl semantics) configuration
+      local boot_reserved_hosts = configuration:find_by_host(boot_reserved_domain, false)
+      ngx.log(ngx.DEBUG, 'schedule new configuration loading')
+      local curr_interval = #boot_reserved_hosts == 0 and 0 or interval
+      schedule(curr_interval, handler, configuration)
   else
     ngx.log(ngx.DEBUG, 'no scheduling for configuration loading')
   end

Verification steps

I have tested locally succesfully the following use cases:

Common initial setup for all use cases

make development
make dependencies 
cat <<EOF >config.json
{
  "services": [
    {
      "backend_version": "1",
      "proxy": {
        "hosts": [
          "example.com"
        ],
        "api_backend": "https://echo-api.3scale.net",
        "backend": {
          "endpoint": "http://127.0.0.1:8081",
          "host": "backend"
        },
        "proxy_rules": [
          {
            "http_method": "GET",
            "pattern": "/",
            "metric_system_name": "hits",
            "delta": 1,
            "parameters": [],
            "querystring_parameters": {}
          }
        ],
        "policy_chain": [
          {
            "name": "apicast.policy.apicast"
          }
        ]
      }
    }
  ]
}
EOF

You should see the (debug) log line about no scheduling for configuration loading:

2024/06/11 21:51:42 [debug] 249417#249417: *1 configuration_loader.lua:217: init_worker(): no scheduling for configuration loading

Even if you kill the worker process, the re-spawned worked process does not load config and uses the configuration at boot time

docker exec -ti -u root apicast_build_0-development-1 /bin/bash
bash-4.4# ls /proc | grep '[0-9]' | xargs -I {} sh -c '[ -a /proc/{}/cmdline ] && grep -q -P "nginx: worker process" /proc/{}/cmdline && echo killing pid {} && kill {}'

APICAst logs confirm the re-spawned worked process does not load config

2024/06/11 21:54:26 [notice] 249403#249403: signal 17 (SIGCHLD) received from 249417
2024/06/11 21:54:26 [notice] 249403#249403: worker process 249417 exited with code 0
2024/06/11 21:54:26 [notice] 249403#249403: start worker process 249558
2024/06/11 21:54:26 [notice] 249403#249403: signal 29 (SIGIO) received
2024/06/11 21:54:26 [debug] 249558#249558: *2 executor.lua:26: init_worker(): executor phase: init_worker
2024/06/11 21:54:26 [debug] 249558#249558: *2 policy_chain.lua:214: init_worker(): policy chain execute phase: init_worker, policy: Load Configuration, i: 1
2024/06/11 21:54:26 [debug] 249558#249558: *2 configuration_loader.lua:217: init_worker(): no scheduling for configuration loading

You should see the (debug) log line about schedule new configuration loading:

2024/06/11 21:55:48 [debug] 249600#249600: *1 configuration_loader.lua:213: init_worker(): schedule new configuration loading

And every 30 secs,

2024/06/11 21:56:48 [info] 249600#249600: *3 configuration_loader.lua:194: auto updating configuration, context: ngx.timer

The functionality implemented in https://github.com/3scale/APIcast/pull/1399/files still applies:

coderbydesign commented 3 months ago

Thanks for the eyes, @eguzki! I'm going to apply this patch locally and test for sanity as well, but then am happy to update this PR with the proposed change. I was actually curious if that entire conditional could be reworked, but didn't want to change too much without all the historical context. What you've explained makes total sense.

coderbydesign commented 3 months ago

Thanks for your patience on this, @eguzki. When you're able, see the latest update based on your suggestions.

eguzki commented 3 months ago

changes looks look to me.

Re-running the test as we have some (hateful) flaky tests

eguzki commented 3 months ago

@tkan145 we might want to backport this fix to 3scale-2.15-stable branch