3scale / APIcast

3scale API Gateway
Apache License 2.0
305 stars 171 forks source link

[THREESCALE-9537] Configure batcher policy storage #1452

Closed tkan145 closed 6 months ago

tkan145 commented 7 months ago

What

Fixes: https://issues.redhat.com/browse/THREESCALE-9537

Dev notes

1. What shared dict value should we increase?

3scale batcher policy use a few different shared dict caches

lua_shared_dict cached_auths 20m;
lua_shared_dict batched_reports 20m;
lua_shared_dict batched_reports_locks 1m;

and api_keys if Caching policy included in the chain

lua_shared_dict api_keys 30m;

First let's run some test to see how much 1m of shared cache can hold.

lua_shared_dict test 1m;
location = /t {
    content_by_lua_block{
        local rep = string.rep
        local dt = ngx.shared.test
        local val = rep("v", 15)
                local key = rep("k", 32)
        local i = 0
        while i < 200000 do
                    local ok, err = dt:safe_add("service_id:_" .. i ..",user_key:".. key ..",metric:hits", i)
                    if not ok then
                break
            end
            i = i + 1
        end
        ngx.say(i, " key/value pairs inserted")
    }
}

The reason to use safe_set() here is to prevent automatic evicting least recently used items upon memory shortage in set().

Querying /t gives the response body on my Linux x86_64 system. NOTE: you may get different value as this actually depends on the underlying architecture.

4033 key/value pairs inserted.

So a 1m store can hold 4033 key/value pairs with 57-byte keys and 15-byte values. In reality, the actual available space will depend on the memory fragment but since these key/value pairs are consistent in size, we should have no problem. More details here

Changing the "dict" store to 10m gives

40673 key/value pairs inserted.

So a 10m store can hold 40673 pairs. It's a linear growth as expected.

Name Method When full TTL Growth
cached_auths set/get evict old/expired key 10 1 for each new transaction
batched_reports safe_add/incr return error 1 for each transaction that return 200. If the key exist then update the existing key. Flush all keys after batch report seconds timeout (default 10s)
batched_reports_locks set/delete evict old/expired key None. 1 for each transaction. But lock is release in the same function
api_keys get/set/delete evict old/expired key None

So we can see that all will grow the equally, but due to the use of safe_add and cache reports for 10 seconds, onlybatched_reports will return no memory error.

A possible workaround is to set batch_report_seconds to lower value.

2. Do I need to increase the batcher policy storage.

Let do a small test and increase the key size:

Key Value Key format credential lengths Key/values pairs
batched_reports 20m service_id:,user_key:,metric: 60 81409
120 81409
142 40705
400 20400

with a key that ~400bytes, with the default report time of 10s for the batched_reports to be fully filled, it would require 20400/10 = 2040 req/sec. It's very unlikely that a single gateway will be hit with this much traffic.

@eguzki do you know what is the highest load a single gateway can handle?

Verification Steps

Filling the storage is a bit tricky so I just check to see if the configuration file is filled with the correct value.

$ grep -nr "batched_reports" /tmp

/tmp/lua_PRpxLW:67:lua_shared_dict batched_reports 40m;
/tmp/lua_PRpxLW:68:lua_shared_dict batched_reports_locks 1m;
tkan145 commented 6 months ago

Also fix the test steps.

Question: The shared dict is shared for all the 3scale products? So if I have 20m, those are shared for me and other 3scale users?

Yes the shared dict is shared between workers and all 3scal products and perhaps user also.

Maybe I would add some documentation with your tests, saying what you can get for the default values of the policy and the new env var for several key sizes. Then the same for half/double of the default value of batch_report_seconds. Same thing for haf/double for the default value of the new env var.

Where do you think that doc would live? inside the top level doc or inside the policy?

eguzki commented 6 months ago

he top level doc or inside the policy

I would say in the specific readme for the batcher policy: https://github.com/3scale/APIcast/blob/master/gateway/src/apicast/policy/3scale_batcher/README.md

tkan145 commented 6 months ago

Thanks @dfennessy. I will need your approval also