erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.4k stars 2.95k forks source link

ERL-376: Unpredictable HIPE failures under NetBSD/amd64 #3541

Closed OTP-Maintainer closed 3 years ago

OTP-Maintainer commented 7 years ago

Original reporter: richn Affected versions: OTP-19.0, OTP-19.3 Fixed in version: OTP-20.0 Component: hipe Migrated from: https://bugs.erlang.org/browse/ERL-376


Up through OTP 18.3, when starting the Erlang VM on my NetBSD/amd64 machine, I would occasionally get the warning:

{code}
hipe_bifs_alloc_data_2: erts_alloc(2) returned 0x7f7ff48b1124 which is not 8-byte aligned
{code}

I recently tried to upgrade to OTP 19.2 and found this error is now fatal and causes a crash dump. The reason is a HIPE NOTSUP error due to the memory not being aligned properly (despite it working in previous versions.)

I tracked it down to {{hipe_bifs_alloc_data_2()}} in {{erts/emulator/hipe/hipe_bif0.c}}. This function uses {{erts_alloc()}} to grab memory, but fails if it isn't 8-byte aligned. What's puzzling is that the runtime includes {{erts_sys_alloc_align()}} which can enforce the required alignment instead of crashing.

As of now, 64-bit NetBSD systems cannot reliably use HIPE after OTP 18.3.
OTP-Maintainer commented 7 years ago

kostis said:

I think there is a simple explanation for this.  First of all, AFAIK at least, the issue with seeing this warning was never reported before this post. This is not very surprising given that there are very few users of Erlang/OTP, let alone HiPE, on NetBSD.  It's definitely _not_ a platform we test.  Second, the {{erts_sys_aligned_alloc()}} (you are most likely referring to this function) was added to the system only about 3-4 years ago, long after the {{hipe_bifs0}} code was written.  So nobody updated that code.

What happens if you substitute the call to {{erts_alloc()}} with {{erts_sys_aligned_alloc()}} ?
OTP-Maintainer commented 7 years ago

richn said:

I was going to do just that, and submit a patch, but ... {{erts_alloc()}} takes a parameter that {{erts_sys_aligned_alloc()}} doesn't. Also, the return value looks like it needs {{erts_sys_aligned_free()}} to properly free it. Since {{hipe_bifs0()}}'s return value has the same type as other functions', it's not obvious to know when to call {{erts_free()}} or {{erts_sys_aligned_free()}}. That's when I thought I'd let HIPE experts take a crack at it. Sorry.
OTP-Maintainer commented 7 years ago

richn said:

@kostis :

{quote}First of all, AFAIK at least, the issue with seeing this warning was never reported before this post. This is not very surprising given that there are very few users of Erlang/OTP, let alone HiPE, on NetBSD. It's definitely not a platform we test.{quote}

Understandable. I am actually surprised that the memory allocator on 64-bit NetBSD still aligns on 4-byte boundaries. I'll ask that community if it's intended. (I haven't seen this problem on my 32-bit NetBSD systems.)

Regardless, if the code requires an alignment and there's a way to specify the alignment, crashing is not the correct solution.
OTP-Maintainer commented 7 years ago

sverker said:

The warning says {{erts_alloc()}} was called for only 2 bytes and I guess that's why NetBSD
deems it ok to not return 8 byte aligned.

The call to hipe_bifs_alloc_data_2 is made from hipe_unified_loader.erl to get memory for constant terms.
The value 2 is written in the beam file by the hipe compiler and that is not enough to store the smallest of terms,
which is kind of fishy.

One fix/workaround for this is to change the HIPE allocations to use erts own allocators
which are using {{mmap()}} (by default if available) and guarantees 8 byte alignment for all
allocations:

{noformat}
diff --git a/erts/emulator/beam/erl_alloc.types b/erts/emulator/beam/erl_alloc.types
index 6e8710e..e756fc6 100644
--- a/erts/emulator/beam/erl_alloc.types
+++ b/erts/emulator/beam/erl_alloc.types
@@ -346,7 +346,7 @@ type        SL_MPATHS       SHORT_LIVED     SYSTEM          sl_migration_paths
 +if hipe

 # Currently most hipe code use this type.
-type   HIPE            SYSTEM          SYSTEM          hipe_data
+type   HIPE            LONG_LIVED      SYSTEM          hipe_data

 +if exec_alloc
 type   HIPE_EXEC       EXEC            CODE            hipe_code
{noformat}

I will do this fix for master as I see no reason why HIPE should use
malloc/free for its allocations which are all long lived.
OTP-Maintainer commented 7 years ago

kostis said:

Sounds a good idea.  Perhaps Rich can check the fix once you have done this change.  (Although this does not solve his immediate problem which is to use a 19.x on his machine.)
OTP-Maintainer commented 7 years ago

sverker said:

Here is another smaller fix based on the assumption that only small allocations may not be aligned.

{noformat}
diff --git a/erts/emulator/hipe/hipe_bif0.c b/erts/emulator/hipe/hipe_bif0.c
index c0ebc34..827ec67 100644
--- a/erts/emulator/hipe/hipe_bif0.c
+++ b/erts/emulator/hipe/hipe_bif0.c
@@ -434,6 +434,8 @@ BIF_RETTYPE hipe_bifs_alloc_data_2(BIF_ALIST_2)
     nrbytes = unsigned_val(BIF_ARG_2);
     if (nrbytes == 0)
        BIF_RET(make_small(0));
+    if (nrbytes < align)
+        nrbytes = align;
     block = erts_alloc(ERTS_ALC_T_HIPE, nrbytes);
     if ((unsigned long)block & (align-1)) {
        fprintf(stderr, "%s: erts_alloc(%lu) returned %p which is not %lu-byte aligned\r\n",

{noformat}

We just released 19.3 which is last planned service release for 19.
I could put the above patch in the pipe for 19.3.1.

@[~kostis] I still think it's strange the hipe compiled beam file wants 2 bytes for constant data.
I looked at hipe_unified_loader.er and if I understood it correctly the smallest unit
it writes to that allocated data is 4 bytes.

@[~RichN] Do you know which loaded module that lead to that crash and can you publish its Erlang source.
OTP-Maintainer commented 7 years ago

richn said:

@sverker
{quote}We just released 19.3 which is last planned service release for 19. I could put the above patch in the pipe for 19.3.1.{quote}

We're running well with OTP 18.3 and can wait until OTP 20 to upgrade the system. I just didn't want NetBSD to be orphaned on 18.x. :)

{quote}Do you know which loaded module that lead to that crash and can you publish its Erlang source.{quote}

I need to rebuild 19.3 to recreate the problem. I think I remember which source file caused it, but I don't know where in the file because I just looked at the stack trace which was deep in the HiPE code (which is how I found {{hipe_bif0()}}).
OTP-Maintainer commented 7 years ago

kostis said:

We spent some time looking at this today.  We think that the initial behavior reported is very weird and needs to be investigated because it may show some other, deeper, problem.

First of all, these allocations are not really for _constants_ per se.  Instead, at least one place where they occur is whenever _jump tables_ are created.  For example, in the following program:
{code:erlang}
t(X) ->
  case X of
    1 -> one;
    2 -> two;
    3 -> three;
    4 -> four;
    5 -> five;
    6 -> six;
    7 -> seven;
    8 -> eight;
    9 -> nine
  end.
{code}
which will create a jump table for the switch and will result in a call to hipe_bifs_alloc_data_2() with nrbytes = 72.

On a system configured with --enable-native-libs we see calls to hipe_bifs_alloc_data_2 of this form:
{noformat}
hipe_bifs_alloc_data_2: erts_alloc(688) call
hipe_bifs_alloc_data_2: erts_alloc(61) call
hipe_bifs_alloc_data_2: erts_alloc(42) call
hipe_bifs_alloc_data_2: erts_alloc(1616) call
hipe_bifs_alloc_data_2: erts_alloc(976) call
hipe_bifs_alloc_data_2: erts_alloc(576) call
...
hipe_bifs_alloc_data_2: erts_alloc(25512) call
hipe_bifs_alloc_data_2: erts_alloc(96) call
hipe_bifs_alloc_data_2: erts_alloc(96) call
hipe_bifs_alloc_data_2: erts_alloc(12) call
hipe_bifs_alloc_data_2: erts_alloc(121) call
{noformat}
So in all cases the numbers are much bigger than just 2.

Is it possible to get some test case that shows how the 2 is generated?  In the meantime, we'll try to see whether there are calls to this function in other circumstances that we are not aware of.
OTP-Maintainer commented 7 years ago

kostis said:

We investigated further and found out how sub-word constants can (and should be) created when compiling a file to native code.
An example is in the code:
{code:erlang}
t(X) -> <<X, "42">>.
{code}

A pull request (#1386 -- here: https://github.com/erlang/otp/pull/1386), containing really minimal changes at this point, most likely fixes the issue, but this needs to be confirmed by somebody with access to a NetBSD/amd64 machine.  Also, we need to check that the changes do not cause any issues in other platforms (esp. ARM).
OTP-Maintainer commented 7 years ago

richn said:

@sverker
{quote}Do you know which loaded module that lead to that crash and can you publish its Erlang source.{quote}

Here's a sample module that dumps the contents of binaries in several formats:

{code:erlang}
-module(dump).

-export([binary/1, binary/2, binary_swap/1, binary_swap/2,
         binary_short/1, binary_short/2]).

%% Modes of display
-define(NORMAL, 'Normal').
-define(BYTE_SWAP, 'Swap').
-define(SHORT, 'Short').

hc(C) -> element((C band 16#f) + 1, {$0, $1, $2,$3, $4, $5, $6, $7,
                                     $8, $9,$a, $b, $c, $d, $e, $f}).

dump_char(Ch) when Ch < 16#20 orelse Ch > 16#7e -> $.;
dump_char(Ch) -> Ch.

hex(Mode, Tmp) ->
    case Mode of
        ?NORMAL ->
            << <<(hc(X bsr 4)), (hc(X)), 16#20>> ||
                <<X>> <= Tmp >>;
        ?BYTE_SWAP ->
            << <<(hc(Y bsr 4)), (hc(Y)), 16#20, (hc(X bsr 4)), (hc(X)), 16#20>> ||
                <<X,Y>> <= Tmp >>;
        ?SHORT ->
            << <<(hc(Y bsr 4)), (hc(Y)), (hc(X bsr 4)), (hc(X)), 16#20>> ||
                <<X,Y>> <= Tmp >>
    end.

dump_row(_, _, _, <<>>) ->
    [];
dump_row(Hdr, Offset, Mode, Tmp) ->

    Hex = hex(Mode, Tmp),
    [Hdr,
     <<(hc(Offset bsr 12)), (hc(Offset bsr 8)), (hc(Offset bsr 4)), (hc(Offset)),
       ": ">>,
     Hex,
     binary:copy(<<16#20>>, 49 - size(Hex)),
     << <<(dump_char(X))>> || <<X>> <= Tmp >>,
     16#0A].

binary_(Hdr, Offset, Mode, <<Chunk:16/binary, Rest/binary>>) ->
    [dump_row(Hdr, Offset, Mode, Chunk) | binary_(Hdr, Offset + 16, Mode, Rest)];
binary_(Hdr, Offset, Mode, Rest) ->
    dump_row(Hdr, Offset, Mode, Rest).

-spec binary(string() | binary()) -> iolist().

binary(X) -> binary(X, 0).

-spec binary(string() | binary(), non_neg_integer()) -> iolist().

binary(List, Indent) when is_integer(Indent), is_list(List) ->
    binary_(binary:copy(<<16#20>>, Indent), 0, ?NORMAL, list_to_binary(List));
binary(Bin, Indent) when is_integer(Indent), is_binary(Bin) ->
    binary_(binary:copy(<<16#20>>, Indent), 0, ?NORMAL, Bin).

-spec binary_swap(string() | binary()) -> iolist().

binary_swap(X) -> binary_swap(X, 0).

-spec binary_swap(string() | binary(), non_neg_integer()) -> iolist().

binary_swap(List, Indent) when is_integer(Indent), is_list(List) ->
    binary_(binary:copy(<<16#20>>, Indent), 0, ?BYTE_SWAP, list_to_binary(List));
binary_swap(Bin, Indent) when is_integer(Indent), is_binary(Bin) ->
    binary_(binary:copy(<<16#20>>, Indent), 0, ?BYTE_SWAP, Bin).

-spec binary_short(string() | binary()) -> iolist().

binary_short(X) -> binary_short(X, 0).

-spec binary_short(string() | binary(), non_neg_integer()) -> iolist().

binary_short(List, Indent) when is_integer(Indent), is_list(List) ->
    binary_(binary:copy(<<16#20>>, Indent), 0, ?SHORT, list_to_binary(List));
binary_short(Bin, Indent) when is_integer(Indent), is_binary(Bin) ->
    binary_(binary:copy(<<16#20>>, Indent), 0, ?SHORT, Bin).
{code}

When I compile this with OTP 19.2 on NetBSD/amd64, I get this:

{noformat}
$ erlc +native dump.erl
hipe_bifs_alloc_data_2: erts_alloc(496) returned 0x7f7fade01210 which is not 128-byte aligned

=ERROR REPORT==== 22-Mar-2017::13:52:27 ===
Native loading of /usr/local/erlang-otp-19.2/lib/erlang/lib/hipe-3.15.3/ebin/hipe.beam failed:
{'EXIT', {notsup, [{hipe_bifs, alloc_data, [8, 496], []},
                   {hipe_unified_loader, create_data_segment, 4, [{file, "hipe_unified_loader.erl"}, {line, 707}]},
                   {hipe_unified_loader, load_common, 5, [{file, "hipe_unified_loader.erl"}, {line, 220}]},
                   {hipe_unified_loader, load_native_code, 3, [{file, "hipe_unified_loader.erl"},  {line, 111}]},
                   {code_server, try_load_module_2, 6, [{file, "code_server.erl"},  {line, 1131}]},
                   {code_server, loop, 1, [{file, "code_server.erl"}, {line, 154}]}]}}
dump.erl: internal error in native_compile;
crash reason: undef

  in function  hipe:compile/4
     called as hipe:compile(dump,[], <<70,79,82, ... ,26,9,20>>, [])
  in call from compile:native_compile_1/1 (compile.erl, line 1405)
  in call from compile:'-internal_comp/4-anonymous-1-'/2 (compile.erl, line 321)
  in call from compile:fold_comp/3 (compile.erl, line 347)
  in call from compile:internal_comp/4 (compile.erl, line 331)
  in call from compile:'-do_compile/2-anonymous-0-'/2 (compile.erl, line 179)
{noformat}

This is different behavior from when I opened this issue. Somehow I actually built these libraries and the crash occurred when I tried to run the code in the VM. Now I can't even compile it.

I'm building your *git* branch to see if your changes fix the problem.
OTP-Maintainer commented 7 years ago

richn said:

I built your pull-request on GitHub and things are looking very good. I'm able to compile all files without any crashes and when I start up our Erlang systems (which use boot scripts), I don't get any memory alignment warnings or errors.

Thanks for fixing it!
OTP-Maintainer commented 7 years ago

sverker said:

Fix merged to master for OTP-20 at
https://github.com/erlang/otp/commit/173cc4865cf0183242904283ea68626b5900ff08