Open ruslandoga opened 10 months ago
I've also tried to profile ETS backend and noticed that the workers Hammer.Backend.ETS
don't actually perform any work, so the poolboy seems like pure overhead. When profiling poolboy process during k6
run, I'm getting the following from eprof
****** Process <0.462.0> -- 100.00 % of profiled time ***
FUNCTION CALLS % TIME [uS / CALLS]
-------- ----- ------- ---- [----------]
gen_server:try_handle_info/3 1596 0.12 465 [ 0.29]
queue:filter/2 1590 0.16 657 [ 0.41]
queue:filter_f/2 3322 0.17 664 [ 0.20]
lists:split/2 3187 0.17 692 [ 0.22]
poolboy:'-handle_info/2-fun-0-'/2 6425 0.20 800 [ 0.12]
poolboy:handle_info/2 1596 0.23 934 [ 0.59]
lists:reverse/2 6374 0.27 1065 [ 0.17]
lists:split/3 11967 0.32 1291 [ 0.11]
queue:filter_r/2 6283 0.45 1806 [ 0.29]
queue:in/2 16622 0.58 2323 [ 0.14]
gen_server:reply/2 44368 1.17 4687 [ 0.11]
gen_server:try_dispatch/3 45958 1.21 4841 [ 0.11]
queue:out/1 44368 1.46 5834 [ 0.13]
gen_server:handle_common_reply/8 62580 1.60 6420 [ 0.10]
gen_server:try_handle_call/4 44372 1.86 7459 [ 0.17]
ets:match/2 1596 1.99 7965 [ 4.99]
gen_server:try_handle_cast/3 44362 2.15 8615 [ 0.19]
poolboy:handle_checkin/2 44368 2.82 11300 [ 0.25]
ets:lookup/2 44362 3.21 12871 [ 0.29]
poolboy:handle_call/3 44372 3.26 13034 [ 0.29]
ets:delete/2 44368 3.50 14031 [ 0.32]
gen_server:decode_msg/9 90330 3.60 14419 [ 0.16]
gen_server:handle_msg/6 90330 4.27 17103 [ 0.19]
poolboy:handle_cast/2 44362 4.70 18809 [ 0.42]
ets:insert/2 44368 5.95 23828 [ 0.54]
erlang:monitor/2 44372 7.51 30054 [ 0.68]
erlang:demonitor/1 44362 14.33 57358 [ 1.29]
gen:reply/2 44368 16.02 64132 [ 1.45]
gen_server:loop/7 90330 16.72 66924 [ 0.74]
--------------------------------- ------- ------- ------ [----------]
Total: 1016858 100.00% 400381 [ 0.39]
Note that time spent in :ets
calls is a small fraction of total time spent, which I assume is spent in :poolboy
/ :gen
internals.
Would there be any interest in optimizing the ETS backend? I'd be happy to work on that. The first step would be (I think) to add benchmarks and then compare the performance with and without poolboy to check if my guesses are right.
I've been able to reproduce the problem outside of Plausible with this script
1..1_000_000
|> Task.async_stream(
fn i ->
if rem(i, 666) == 0 do # we need sampling
{t, r} =
:timer.tc(fn ->
Hammer.check_rate("ingest:site:loadtest.site", 60_000, 1_000_000)
end)
# since IO.inspect is too slow and allows Poolboy to clear the queue
IO.inspect(t, label: i)
r
else
# and the "actual overload" happens in this "fast" branch
Hammer.check_rate("ingest:site:loadtest.site", 60_000, 1_000_000)
end
end,
ordered: false,
max_concurrency: 500
)
|> Stream.run()
You might need to play with max_concurrency: 500
and rem(i, 666)
numbers to make Poolboy overload gradually like with k6 benchmark. On my system, these recreate the Plausible/k6 situation:
MIX_ENV=bench mix run bench/ets.exs
Note that this can't be reproduced in Benchee since it waits for each request to complete which allows Poolboy to keep the queue empty.
Hello!
thanks for the detail report. I took over the project and there is bunch of work I want to try to find time to work on eventually. I have not run into this issue yet but yea the perf is either pool boy or ETS because we bypass the Genserver in the latest version and directly talk to ETS.
That being Said, Pool boy is configurable. Have you tried tweaking it and make it bigger just to see how it behaves?
Removing poolboy might be tricky because we do need for solution like REDIS and other backend so having a pool in front is helpful. We could consider removing the pool and see what happens ? One thing is that.
ETS is not shared accross VM so not sure how many people use it because you will not be able to do rate limiting with multiple pods correctly (or it will have to be approximate and set a value that is pod number dependent)
I ll see if I can spend some time in the coming weeks but if you have an MR or some idea, feel free to submit it as well.
👋
That being Said, Pool boy is configurable. Have you tried tweaking it and make it bigger just to see how it behaves?
That would just delay the inevitable, I think. As https://github.com/ExHammer/hammer/issues/71#issuecomment-1822702957 shows the overload would happen anyway once Poolboy stops keeping up, the pool size largely doesn't matter unless it matches the number of clients.
Removing poolboy might be tricky because we do need for solution like REDIS and other backend so having a pool in front is helpful. We could consider removing the pool and see what happens ? One thing is that.
The pooling strategy could be moved to the adapter, the Ecto way.
ETS is not shared accross VM so not sure how many people use it
you will not be able to do rate limiting with multiple pods correctly
It could be actually worked around with sticky sessions or Phx PubSub
I ll see if I can spend some time in the coming weeks but if you have an MR or some idea, feel free to submit it as well.
I'm happy to work on this since it seems to directly affect Plausible, I just wanted to check with the project maintainers first :)
I've started exploring various approaches and the current fastest one is PlugAttack: https://github.com/ruslandoga/rate_limit
Atomics seem promising as well. Maybe :hammer
can provide two backends by default, ETS (based off of PlugAttack) and Atomics.
I am not too familiar with Atomics yet but sounds like a good idea.
I am not against moving the pooling strategy, we will need to create a major version correctly for it and all of the backend then
@ruslandoga are you still working on splitting the MR? there are changes I was you had that I was going to tackle but if you are not then I ll start working on them
Hi @epinault ! I'm currently not working on this. I do plan to return to it this month though.
Sorry for disappearing and not keeping my promise :) I'm going to continue working on my PRs later this week. in August. later this year!
Describe the bug I've been benchmarking hot paths in https://github.com/plausible/analytics and noticed that ExHammer check here becomes slower over time, from 11us in the beginning of k6 benchmark (it hits the same site every time, so
:hammer_ets_buckets
has a single key throughout the benchmark) to over 100ms in the end. And in some requests (when user-agents are cached) thatcheck_rate
call becomes the largest slice in the request flamechart** Provide the following details