erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.4k stars 2.95k forks source link

Internal compiler error in ssa_opt_type_continue #7509

Open frej opened 1 year ago

frej commented 1 year ago

Describe the bug

Erlc can sometimes crash with a back-trace like this:

Sub pass ssa_opt_type_continue
Function: '-f3/1-lc$^0/1-3-'/1
test362604.erl: internal error in pass beam_ssa_opt:
exception error: no function clause matching beam_types:join([]) 
  in function  beam_ssa_type:join_arg_types/3 (beam_ssa_type.erl, line 449)
  in call from beam_ssa_type:opt_continue/4 (beam_ssa_type.erl, line 432)
  in call from beam_ssa_opt:ssa_opt_type_continue/1 (beam_ssa_opt.erl, line 449)
  in call from compile:run_sub_passes_1/3 (compile.erl, line 424)
  in call from beam_ssa_opt:phase/4 (beam_ssa_opt.erl, line 116)
  in call from beam_ssa_opt:fixpoint/6 (beam_ssa_opt.erl, line 99)
  in call from beam_ssa_opt:run_phases/3 (beam_ssa_opt.erl, line 85)

To Reproduce Unless the patch below is applied, the crash appears to be dependent on the phase of the moon :) With the patch applied, the reproducer (found with Erlfuzz, thanks @RobinMorisset ) triggers the error reliably.

diff --git a/lib/compiler/src/beam_ssa_type.erl b/lib/compiler/src/beam_ssa_type.erl
index d492ed4b50..9d8b965bd3 100644
--- a/lib/compiler/src/beam_ssa_type.erl
+++ b/lib/compiler/src/beam_ssa_type.erl
@@ -379,7 +379,7 @@ init_sig_st(StMap, FuncDb) ->
              wl=wl_defer_list(Roots, wl_new()) }.

 init_sig_roots(FuncDb) ->
-    [Id || Id := #func_info{exported=true} <- FuncDb].
+    [Id || Id := #func_info{exported=true} <- maps:iterator(FuncDb, ordered)].

 init_sig_args([Root | Roots], StMap, Acc) ->
     #opt_st{args=Args0} = map_get(Root, StMap),

Affected versions At least maint (fe0571ecad3dd01609f2bd237a38816912fbd23f)

Additional context

As far as I can tell, beam_ssa_type:signatures_1/2 can produce a different #sig_st{} and FuncDb depending on the order of the initial worklist in the #sig_st{}. This, in turn, leads to number of list-comprehension functions being eliminated from optimization when the compilation succeeds, but kept around when the crash occurs.

RobinMorisset commented 1 year ago

Thanks for the mention! This looks like it might be a duplicate of #7478 ?

frej commented 1 year ago

This looks like it might be a duplicate of #7478 ?

Looks very likely, I wonder what I did wrong when I searched for beam_types:join among the issues yesterday and didn't find this one?

Permuting the order of functions returned by init_sig_roots(FuncDb) does not fix the problem in #7478 , probably because we have only one exported function which calls the list comprehension.

jhogberg commented 1 year ago

Crashing on beam_ssa_type:join_arg_types/3 in this manner means that previously unreachable code has suddenly become reachable, which is not supposed to happen (other way around is fine). The cause is nearly always some silly small thing that accidentally widens a type.

In the absence of bugs like these, the result will be equivalent regardless of work order, so I prefer to leave it undefined to help shake more bugs out. I’ll look deeper into it today. :)

jhogberg commented 1 year ago

It looks like this has the same root cause as #7342 / #6599 :(

frej commented 1 year ago

I couldn't keep away from scratching this itch, #7512 fixes this crash and the crashes in #7478.

I took a quick look at the reproducers in #7342 and #6599 but for those, #7512 does nothing. So I don't think I agree that this is the same root case as the latter issues :)

jhogberg commented 1 year ago

Like I said before, this kind of crash happens when another bug suddenly makes previously unreachable code reachable. That bug is #7342/#6599 in this case (failure to propagate type information between tests on different variables relating in some way to the same value).

7512 fixes the crash by sweeping the bug under the rug, which I don't want to do in the general case because the code being unreachable at first could have been an error. It happens to be safe in this particular case, but we can't say that it will always be so.

frej commented 1 year ago

7512 fixes the crash by sweeping the bug under the rug, which I don't want to do in the general case because the code being unreachable at first could have been an error. It happens to be safe in this particular case, but we can't say that it will always be so.

My take is the exact opposite. The crashes in the reproducers of #7478 and #7512 are because an unreachable function is, on entry to beam_ssa_type:opt_start/2, considered reachable, is still considered reachable by beam_ssa_type:opt_continue/4 but lacks the meta data a reachable function should have in #func_info.arg_types in the function info database (as that is only updated when a call is encountered).

My analysis is that on entry to opt_start the function we'll crash in (call it f), is considered reachable . Running signatures/2 doesn't change that. opt_start_1 traverses all functions in the module and calls opt_function which will use the type knowledge we have to traverse the function's CFG. If we reach a local call, opt_local_call() will update the argument types for the callee in the function info database. In the reproducers, opt_local_call() is never called with f as a callee and hence arg_types for it will never be updated.

Unless we prune functions which become unreachable when we use type knowledge beam_ssa_type:opt_continue/4 will crash as soon as it calls beam_ssa_type:join_arg_types/3. From trace printouts I can see that at no point in time f has anything but a list of empty dictionaries in #func_info.arg_types, so this is clearly not a case of previously unreachable code becoming reachable.

The only bug I can see being swept under the rug with #7512, is if something is wrong with the type analysis so that it considers reachable code unreachable. But in that case, what stops us from crashing when it is right and a function which looks reachable turns out not to be, during beam_ssa_type:opt_start/2?

jhogberg commented 1 year ago

You're looking at the wrong part of the code, the initial problem occurs much earlier during signatures/2 where a function that used to be reachable under a narrow argument type ("argument 2 only be 'false'") suddenly becomes unreachable with a wider argument type ("argument 2 can be 'false' | 'ok'") in the middle of the analysis. Note that at this stage the invariant is inverted: previously reachable code cannot be made unreachable.

This confuses the heck out of opt_start/2 which treats it as reachable during pruning and unreachable during its analysis (which adds the argument types used by opt_continue/4).

Pruning the functions once more as in #7512 hides the bug in signatures/2.

frej commented 1 year ago

I see, that makes sense. I'll withdraw #7512.