erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.18k stars 2.92k forks source link

Compiler Performance Regression in SSA Sink Optimization #5808

Open davisp opened 2 years ago

davisp commented 2 years ago

Describe the bug

There's a performance regression in the Erlang compiler that started in Erlang 22 and exists as of master today. From what I can tell the issue is that the SSA sink optimization for get_tuple_element is maintaining unnecessary references to items on the stack. In some very specific circumstances this leads to some code taking significantly more time and RAM to execute.

Given that the behavior is still technically correct I'm not 100% certain whether I'd call this a bug or just a performance regression. The scenario is fairly difficult to generate on purpose but it can certainly have a sizable impact on performance. There were CouchDB users with real world reports of some operations that take less than a second on Erlang 20 failing to complete after chewing through tens of gigabytes of RAM with Erlang 23.

After narrowing down a minimal test case (included inline below) I believe the issue is in this difference between the generated Erlang Assembly without and with the SSA sink optimization:

@@ -357,19 +358,19 @@
   {label,37}.
     {test,is_tuple,{f,39},[{x,1}]}.
     {test,test_arity,{f,39},[{x,1},3]}.
-    {allocate,2,2}.
-    {get_tuple_element,{x,1},0,{x,2}}.
-    {get_tuple_element,{x,1},1,{y,1}}.
-    {get_tuple_element,{x,1},2,{y,0}}.
-    {move,{x,2},{x,1}}.
+    {allocate,1,2}.
+    {move,{x,1},{y,0}}.
+    {get_tuple_element,{x,1},0,{x,1}}.
     {line,[{location,"stack_test.erl",39}]}.

If I'm reading that correctly, the unoptimized version ends up storing two of the three tuple elements on the stack and then passes the third directly to the body recursive call. In the optimized version we end up storing a single reference to the entire tuple and passing the first element to the recursive call. My theory here is that this extra reference in the tuple on the stack is preventing garbage collection from clearing out many of the intermediate versions of that first tuple element. And if, as in our particular test case, this value grows large it can end up using significantly more RAM with the optimization.

I should mention that hitting this particular edge case is fairly difficult to engineer. I was unable to come up with a test case from scratch and ended up having to reduce an example that was found in the wild. Even just shortening the tuple to two elements prevents the sink optimization from keeping a reference to the tuple on the stack.

To Reproduce

I've included a reproducing module at the bottom of this issue. To see the bug in action you just need to compile with and without the SSA sink optimization like such:

$ erlc stack_test.erl && erl -noshell -eval 'io:format("~p~n", [stack_test:run()]), init:stop().'
{4693,6203.939636230469}
$ erlc +dssappt +no_ssa_opt_sink stack_test.erl && erl -noshell -eval 'io:format("~p~n", [stack_test:run()]), init:stop().'
{907,213.61334228515625}

On my ancient MacBook Pro the optimized version takes roughly 5s and 6GiB of RAM to execute. Disabling the optimization reduces the time to execute to roughly 1s and only uses 200 MiB of RAM.

Expected behavior

Use less RAM.

Affected versions

The issue was introduced in 6bee2ac7d11668888d93ec4f93730bcae3e5fa79 and still exists on master as of 3f45eead8cbcc7226d8a3cd8da9002eb7ef5515e.

Additional context


-module(stack_test).

-export([
    run/0
]).

run() ->
    run(7500, 0.0008).

run(Depth, BranchChance) ->
    rand:seed(exrop, {1647,841737,351137}),

    Tree = node(Depth, BranchChance),

    erlang:garbage_collect(),

    MPid = spawn_mem_sampler(self(), 500),

    T1 = erlang:monotonic_time(),
    visit(Tree, sets:new()),    
    T2 = erlang:monotonic_time(),

    Max = get_max_mem(MPid),
    unlink(MPid),
    exit(MPid, kill),

    {deltaT(T2, T1), Max}.

visit([], Seen) ->
    {sets:add_element(rand:uniform(), Seen), ignore1, ignore2};
visit(Children, Seen0) ->
    Seen1 = sets:add_element(rand:uniform(), Seen0),
    lists:foldl(fun(Child, Acc) ->
        {SeenAcc, Ignore1, Ignore2} = Acc,
        {NewSeenAcc, _, _} = visit(Child, SeenAcc),
        {NewSeenAcc, Ignore1, Ignore2}
    end, {Seen1, ignore1, ignore2}, Children).

node(0, _) ->
    [];
node(Depth, BranchChance) ->
    case rand:uniform() < BranchChance of
        true ->
            [
                node(Depth - 1, BranchChance),
                node(Depth - 1, BranchChance)
            ];
        false ->
            [node(Depth - 1, BranchChance)]
    end.

deltaT(T0, T1) ->
    erlang:convert_time_unit(T0 - T1, native, millisecond).

spawn_mem_sampler(Pid, DtMsec) ->
    spawn_link(fun() -> mem_sampler(Pid, DtMsec, mem_mb(Pid)) end).

mem_sampler(Pid, DtMsec, Max0) ->
    timer:sleep(DtMsec),
    Max = max(mem_mb(Pid), Max0),
    receive
        {get_mem, From} ->
            From ! {mem_max, Max}
    after 0 ->
            ok
    end,
    mem_sampler(Pid, DtMsec, Max).

mem_mb(Pid) ->
    {memory, Words} = erlang:process_info(Pid, memory),
    Bytes = Words * erlang:system_info(wordsize),
    Bytes / (1024 * 1024).

get_max_mem(Pid) ->
    Pid ! {get_mem, self()},
    receive {mem_max, Max} -> Max end.

This is the full diff of the Erlang Assembly generated without the optimization vs with the optimization (both generated with erlc 23.3.4.11):

--- stack_test.no-sink.S    2022-03-22 11:05:41.000000000 -0500
+++ stack_test.opt.S    2022-03-22 11:05:33.000000000 -0500
@@ -240,9 +240,9 @@
     {loop_rec,{f,20},{x,0}}.
     {test,is_tagged_tuple,{f,19},[{x,0},2,{atom,get_mem}]}.
     {test_heap,3,1}.
-    {get_tuple_element,{x,0},1,{x,0}}.
     remove_message.
     {put_tuple2,{x,1},{list,[{atom,mem_max},{y,0}]}}.
+    {get_tuple_element,{x,0},1,{x,0}}.
     {line,[{location,"stack_test.erl",71}]}.
     send.
     {jump,{f,21}}.
@@ -266,13 +266,14 @@
     {move,{atom,memory},{x,1}}.
     {line,[{location,"stack_test.erl",79}]}.
     {call_ext,2,{extfunc,erlang,process_info,2}}.
+    {move,{x,0},{y,0}}.
     {test,is_tagged_tuple,{f,24},[{x,0},2,{atom,memory}]}.
-    {get_tuple_element,{x,0},1,{y,0}}.
     {move,{atom,wordsize},{x,0}}.
     {line,[{location,"stack_test.erl",80}]}.
     {call_ext,1,{extfunc,erlang,system_info,1}}.
+    {get_tuple_element,{y,0},1,{x,1}}.
     {line,[{location,"stack_test.erl",80}]}.
-    {gc_bif,'*',{f,0},1,[{y,0},{x,0}],{x,0}}.
+    {gc_bif,'*',{f,0},2,[{x,1},{x,0}],{x,0}}.
     fclearerror.
     {line,[{location,"stack_test.erl",81}]}.
     {fconv,{x,0},{fr,0}}.
@@ -357,19 +358,19 @@
   {label,37}.
     {test,is_tuple,{f,39},[{x,1}]}.
     {test,test_arity,{f,39},[{x,1},3]}.
-    {allocate,2,2}.
-    {get_tuple_element,{x,1},0,{x,2}}.
-    {get_tuple_element,{x,1},1,{y,1}}.
-    {get_tuple_element,{x,1},2,{y,0}}.
-    {move,{x,2},{x,1}}.
+    {allocate,1,2}.
+    {move,{x,1},{y,0}}.
+    {get_tuple_element,{x,1},0,{x,1}}.
     {line,[{location,"stack_test.erl",39}]}.
     {call,2,{f,6}}.
     {test,is_tuple,{f,38},[{x,0}]}.
     {test,test_arity,{f,38},[{x,0},3]}.
     {test_heap,4,1}.
+    {get_tuple_element,{y,0},1,{x,1}}.
+    {get_tuple_element,{y,0},2,{x,2}}.
     {get_tuple_element,{x,0},0,{x,0}}.
-    {put_tuple2,{x,0},{list,[{x,0},{y,1},{y,0}]}}.
-    {deallocate,2}.
+    {put_tuple2,{x,0},{list,[{x,0},{x,1},{x,2}]}}.
+    {deallocate,1}.
     return.
   {label,38}.
     {line,[{location,"stack_test.erl",39}]}.
davisp commented 2 years ago

I should also mention that in CouchDB we patched around this issue by replacing an instance of sets with a map which ended up being faster across the board regardless of this optimization issue.

davisp commented 2 years ago

Reading through the sink optimization code more today this looks to be basically a duplicate of #4092 (fixed in 720afbc021e272d70c083075fa860e8751a13fcc) where it was noticed that lists:mapfoldl was considerably slower on 22 because the accumulator was being kept alive.

I think the more complete fix would be to not retain any references to a tuple past the last reference of an element of that tuple. This is only my second day looking at the SSA optimization stuff so I have absolutely no idea how complicated that logic might be. I'll try and take a poke at it tomorrow and see if anything shakes out.

davisp commented 2 years ago

Here's a patch that appears to fix the issue:

diff --git a/lib/compiler/src/beam_ssa_opt.erl b/lib/compiler/src/beam_ssa_opt.erl
index da9ebc0118..1e3110dc80 100644
--- a/lib/compiler/src/beam_ssa_opt.erl
+++ b/lib/compiler/src/beam_ssa_opt.erl
@@ -39,8 +39,8 @@

 -include("beam_ssa_opt.hrl").

--import(lists, [all/2,append/1,droplast/1,duplicate/2,flatten/1,foldl/3,
-                keyfind/3,last/1,mapfoldl/3,member/2,
+-import(lists, [all/2,any/2,append/1,droplast/1,duplicate/2,flatten/1,
+                foldl/3,keyfind/3,last/1,mapfoldl/3,member/2,
                 partition/2,reverse/1,reverse/2,
                 splitwith/2,sort/1,takewhile/2,unzip/1]).

@@ -2415,8 +2415,17 @@ filter_deflocs([{Tuple,DefLoc0}|DLs], Preds, Blocks) ->
                           map_get(FromTo, WillGC)
                   end, Paths),

-    %% Avoid potentially harmful sinks.
-    DefLocGC = filter_gc_deflocs(DefLocGC0, Tuple, First, Preds, Blocks),
+    %% If Tuple isn't on the stack, check if any of our
+    %% sunk get_tuple_element results will be denied the chance
+    %% to be GC'ed
+    DelayedGC = not is_on_stack(First, Tuple, Blocks) andalso
+        any(fun({_,{Var,{_,To}}}) ->
+                   not is_on_stack(To, Var, Blocks)
+            end, DefLocNoGC),
+
+    %% Sink anything as far as possible without delaying GC
+    %% for any extracted terms
+    DefLocGC = filter_gc_deflocs(DefLocGC0, Preds, Blocks, DelayedGC),

     %% Construct the complete list of sink operations.
     DefLoc1 = DefLocGC ++ DefLocNoGC,
@@ -2424,48 +2433,34 @@ filter_deflocs([{Tuple,DefLoc0}|DLs], Preds, Blocks) ->
         filter_deflocs(DLs, Preds, Blocks);
 filter_deflocs([], _, _) -> [].

-%% Use an heuristic to avoid harmful sinking in lists:mapfold/3 and
-%% similar functions.
-filter_gc_deflocs(DefLocGC, Tuple, First, Preds, Blocks) ->
+%% Avoid delaying garbage collection by keeping a reference
+%% to a tuple on the stack when one or more of its elements
+%% might be GC'ed.
+filter_gc_deflocs(DefLocGC, _Preds, _Blocks, false) ->
+    DefLocGC;
+filter_gc_deflocs(DefLocGC, Preds, Blocks, true) ->
     case DefLocGC of
         [] ->
             [];
-        [{_,{_,{From,To}}}] ->
-            %% There is only one get_tuple_element instruction that
-            %% can be sunk. That means that we may not gain any slots
-            %% by sinking it.
-            case is_on_stack(First, Tuple, Blocks) of
-                true ->
-                    %% The tuple itself must be stored in a stack slot
-                    %% (because it will be used later) in addition to
-                    %% the tuple element being extracted. It is
-                    %% probably a win to sink this instruction.
-                    DefLocGC;
+        [{_,{_,{From,To}}} | Rest] ->
+            case will_gc(From, To, Preds, Blocks, false) of
                 false ->
-                    case will_gc(From, To, Preds, Blocks, false) of
-                        false ->
-                            %% There is no risk for recursive calls,
-                            %% so it should be safe to
-                            %% sink. Theoretically, we shouldn't win
-                            %% any stack slots, but in practice it
-                            %% seems that sinking makes it more likely
-                            %% that the stack slot for a dying value
-                            %% can be immediately reused for another
-                            %% value.
-                            DefLocGC;
-                        true ->
-                            %% This function could be involved in a
-                            %% recursive call. Since there is no
-                            %% obvious reduction in the number of
-                            %% stack slots, we will not sink.
-                            []
-                    end
-            end;
-        [_,_|_] ->
-            %% More than one get_tuple_element instruction can be
-            %% sunk. Sinking will almost certainly reduce the number
-            %% of stack slots.
-            DefLocGC
+                    %% There is no risk for recursive calls,
+                    %% so it should be safe to
+                    %% sink. Theoretically, we shouldn't win
+                    %% any stack slots, but in practice it
+                    %% seems that sinking makes it more likely
+                    %% that the stack slot for a dying value
+                    %% can be immediately reused for another
+                    %% value.
+                    [DefLocGC | filter_gc_deflocs(Rest, Preds, Blocks, true)];
+                true ->
+                    %% This function could be involved in a
+                    %% recursive call. Since there is no
+                    %% obvious reduction in the number of
+                    %% stack slots, we will not sink.
+                    filter_gc_deflocs(Rest, Preds, Blocks, true)
+            end
     end.

 find_paths_to_check([{_,{_,To}}=Move|T], First) ->
davisp commented 2 years ago

I should note that I haven't got a clue on the proper checklists for submitting a PR to the compiler of all places. If there's a process write up somewhere could you point me at it? That's also assuming that no one looks at this and sees something obviously wrong with the logic there.

frej commented 2 years ago

Last year I did some work on sinking. The then accepted wisdom was that it in almost all cases it is preferable to avoid increasing the size of the stack frame by deferring the deconstruction of a tuple to later in the program.

With the proposed patch you will, for this particular example, do the right thing. But in most cases you will deconstruct the tuple early and put its elements in the stack frame. For example, as you remove the is_on_stack-check you will not sink get_tuple_element-instructions past a call which has already forced the tuple into the stack frame, thus needlessly increasing the size of the stack frame.

As you have noticed, triggering this particular deficiency of the instruction scheduling is pretty hard, so I suspect that the patch will make things worse for the common case. Have you done any benchmarks to check that this doesn't make the common case worse?

In this particular case you could work around the problem by splitting the accumulator into a nested tuple like this:

visit([], Seen) ->
    {sets:add_element(rand:uniform(), Seen), {ignore1, ignore2}};
visit(Children, Seen0) ->
    Seen1 = sets:add_element(rand:uniform(), Seen0),
    lists:foldl(fun(Child, Acc) ->
        {SeenAcc, {Ignore1, Ignore2}} = Acc,
        {NewSeenAcc, {_, _}} = visit(Child, SeenAcc),
        {NewSeenAcc, {Ignore1, Ignore2}}
    end, {Seen1, {ignore1, ignore2}}, Children).

It will produce additional instructions, and cost you an extra word on the heap, but compared to the cost of retaining a much larger ~working~ live set in your particular application, it produces a speedup equal to disabling ssa_opt_sink.

Solving this properly would require us to estimate the size of the retained garbage and trade it off against the increased stack size when deconstructing the tuple early.

davisp commented 2 years ago

@frej Good catch on accidentally increasing the stack frame sizes when the tuple is on the stack. I was mostly concerned with the GC aspect when trying to fix this edge case but you're absolutely right that would increase stack frame sizes in those cases.

I'm definitely aware that there's a lot of subtlety around changing these bits. Trying to work through the logic the other day I was constantly reminding myself that I have absolutely no clue how often various combinations like this actually exist in the wild and that I might be royally messing up a bunch of optimizations where GC doesn't matter.

In terms of benchmarking, I've done little more than just compile a few different test cases to show that it mostly behaves as intended. I haven't even convinced myself that the logic is correct yet as I don't have any idea how to run the compiler test suite. Beyond that I was hoping there was at least a standard way to run some benchmarks to collect performance data. I don't have any good ideas off the top of my head how I'd figure out a representative selection of cases that would be affected by this change.

A couple other ideas I had were to create some sort of option similar to +bin_opt_info that'd list places that developers might want to look for improving performance. It seems like detecting and reporting when delayed GC is a possibility would be safer than attempting to be even more clever on moving those get_tuple_element calls.

The other idea I had was to prevent the GC delay by zeroing out any element that could benefit from GC. But that would include introducing variables into the SSA code and that seemed like a rather bad idea. Not to mention it's way beyond my current understanding of the compiler internals.

Either way, that patch is mostly just me trying to understand how these things fit together more than a serious proposal for inclusion. I'm assuming the folks that stare at this code will have a much better understanding of all the nuance and know whether something like this is even a possibility.

bjorng commented 2 years ago

@davisp Thanks for your bug report.

I should note that I haven't got a clue on the proper checklists for submitting a PR to the compiler of all places. If there's a process write up somewhere could you point me at it?

You can find contribution guidelines here:

https://github.com/erlang/otp/blob/master/CONTRIBUTING.md

There are links on that page to further tips on development and testing.

For compiler development, there is an incredibly useful script in the repository called scripts/diffable. If you run it without any arguments, it will print information about how to use it along with examples.

Regarding your reported bug, @jhogberg and I have discussed it briefly and reached the conclusion that there is probably no easy fix. The solution seems to be to do a more thorough analysis of the entire module and only do the optimization if it can be seen that the optimization can't do any significant harm. We hope that the optimization will still be applied in most of the places where it is applied today. We will not attempt to fix this bug for OTP 25.

davisp commented 2 years ago

@bjorng Awesome, thanks for the update. I'll take a look at that script on Monday.

Yesterday I managed to write a (somewhat hacky) patch that adds a +ssa_opt_info argument to erlc and prints out warnings when it detects delayed garbage collection due to the sink optimization. Setting that flag when compiling CouchDB lists 232 places where it might be an issue. Its a fairly coarse warning since it only gives the function where it might be an issue and so far all of my spot checking is showing false alarms.

I'll keep poking at those warnings and let you know if I spot anything else interesting.