Closed bryanjos closed 3 months ago
Hey @bryanjos can you post a benchee
script or similar somewhere that compares this? The select
should be using only the table key, so in theory the performance of both what .select
and your iterative approach should be both nLog(m)
. That does assume :ets
is smart enough though, maybe it isn't.
It'd be good to see some specific examples to demonstrate this.
@benwilson512 I'll try to put that together.
A teammate made this change so I might not have the full picture. I think the issue stems from select
using the match spec it's using combined with the fact that the registry has duplicate keys.
There's a not at the bottom of the docs for select that I think summarizes the issue Note that for large registries with many partitions this will be costly as it builds the result by concatenating all the partitions.
@benwilson512 I put together a benchee script and recorded the output in this gist. The proposed change is definitely faster, but does use more memory
in practice, i believe lookup
should be O(log(n))
, and select
should be O(n)
, both multiplied by the number of registry partitions. it's basically a full table scan vs an index tree traversal lookup. the truly pathological worst case for a lookup
on a duplicate_bag
would be when all tuples in the table have the same key, which is when it'd become O(n)
.
it'd be a neat PR into the erlang codebase to (compile-time?) warn that the match spec [{:==, :"$1", id}]
is bad mojo.
for what it's worth, this was a service killer for us, it brought down entire nodes, whatever is going on under the covers with select
seems to entirely lock up the VM at a not-unreasonable level of concurrency, it wouldn't even service disterl for rpc health checks. i'm guessing there's something in ets that doesn't yield back to the scheduler (yay bifs). :(
hey @koudelka sorry to hear that it caused such issues. I'm happy to move forward with the PR. There is still a linear part to this PR, because have to call lookup
n
times, once for each returned doc ID. However probably the reason this is still faster is that n
is still wildly less than the total size of the registry, and when we do select
we I guess have to scan the entire registry.
I have a few "code golf" items I'll remark on and then we can head towards a merge.
@benwilson512 Looks like the made a small distance in the right direction. I'll update the PR to use a list as the accumulator and pipe into Map.new
Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 16 GB
Elixir 1.17.0
Erlang 27.0
JIT enabled: true
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 42 s
Benchmarking Using MapSet and lookup ...
Benchmarking Using list accumulator ...
Benchmarking Using select ...
Calculating statistics...
Formatting results...
Name ips average deviation median 99th %
Using list accumulator 9.25 108.11 ms ±6.78% 104.94 ms 149.45 ms
Using MapSet and lookup 9.05 110.49 ms ±3.31% 109.03 ms 126.19 ms
Using select 0.0125 79835.89 ms ±0.00% 79835.89 ms 79835.89 ms
Comparison:
Using list accumulator 9.25
Using MapSet and lookup 9.05 - 1.02x slower +2.38 ms
Using select 0.0125 - 738.49x slower +79727.79 ms
Memory usage statistics:
Name average deviation median 99th %
Using list accumulator 152.54 MB ±0.01% 152.53 MB 152.60 MB
Using MapSet and lookup 158.87 MB ±0.00% 158.87 MB 158.87 MB
Using select 31.18 MB ±0.00% 31.18 MB 31.18 MB
Comparison:
Using list accumulator 152.53 MB
Using MapSet and lookup 158.87 MB - 1.04x memory usage +6.33 MB
Using select 31.18 MB - 0.20x memory usage -121.36031 MB
I am very puzzled at the 5x difference in memory performance between what we have today and the proposed changes. Let me see if I can poke at this and figure out what the cause is.
I'm thinking it's probably the use of the data structures and size of the examples. There probably is something that could tweak that could be made to balance out space and time
Related to #1329, we noticed that looking in the registry would take a long time, making the unbounded concurrency issue worse as processes would wait for the lookups here to finish. This removes the table scan and replaces it with a MapSet instead.