erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.3k stars 2.94k forks source link

Segmentation Fault on String.replace/4 #7492

Closed sekiyama58 closed 1 year ago

sekiyama58 commented 1 year ago

Describe the bug Applying Elixir's String.replace/4 to some specific concatenations strings causes segmentation faults.

To Reproduce Following Dockerfile reproduces the segmentation fault:

FROM hexpm/elixir:1.15.2-erlang-26.0.2-alpine-3.18.2
WORKDIR /app

RUN <<EOF
cat <<EOS >mix.exs
defmodule Test.MixProject do
  use Mix.Project
  def project do
    [
      app: :test,
      version: "0.1.0",
      releases: [test: [include_executables_for: [:unix]]]
    ]
  end
  def application do
    [mod: {Test.Application, []}]
  end
end
EOS
EOF

RUN mkdir lib
RUN <<EOF
cat <<EOS >lib/application.ex
defmodule Test.Application do
  use Application
  def start(_type, _args) do
    "123#{:test}---#{123456789012}------------#{:abc}------------------------#{20000000}"
    |> IO.inspect()
    |> String.replace(~r/\d+/, "?")
    |> IO.puts()
    {:error, :test_only}
  end
end
EOS
EOF

RUN mix release --path release
RUN release/bin/test start

Result:

$ docker build .
...
 > [7/7] RUN release/bin/test start:
#0 0.613 "123test---123456789012------------abc------------------------20000000"
------
Dockerfile:41
--------------------
  39 |
  40 |     RUN mix release --path release
  41 | >>> RUN release/bin/test start
  42 |
--------------------
ERROR: failed to solve: process "/bin/sh -c release/bin/test start" did not complete successfully: exit code: 139

↑ Crashed after IO.inspect() with exit code 139 (segmentation fault)

It might be memory corruption?

Affected versions Confirmed with

I am also experiencing the segmentation fault with debian:bookworm images with similar cases, but it seems more complex conditions are needed to cause segmentation fault in debian. I also couldn't reproduced this with IEx, so the Dockerfile uses mix release.

jhogberg commented 1 year ago

Thanks for your report! Please share the .beam file of the Test.Application module, or better yet the Erlang code generated by the Elixir compiler for said module.

sekiyama58 commented 1 year ago

Hi @jhogberg, please clone from here: https://gist.github.com/sekiyama58/0fc434064c8a92e57eb5f501fc4ba963

jhogberg commented 1 year ago

Thanks! I was hoping that the error could be seen in the reproducer module alone, but that doesn't seem to be the case and the error is most likely further down. Just as a sanity check, does that debian:bookworm image you mentioned use musl?

Either way I think I'm going to need help from the Elixir folks on this, we need a reproducer in pure Erlang that doesn't rely on Docker.

cc @josevalim

sekiyama58 commented 1 year ago

Just as a sanity check, does that debian:bookworm image you mentioned use musl?

No, Debian is based on glibc. In Debian, the same String.replace causes a SEGV after the module initialization in my production code, but I could not reproduce it in simplified code like the Dockerfile.

Also, it seems to happen only on aarch64 platform, but not in x86_64 platform. Might be arm JIT-related??

starbelly commented 1 year ago

FWIW here is going to be the the meat and the potatoes I believe. I suppose it's possible something is choking upstream, but unlikely, but I cover that none the less.

We start with the calling site, elixir expands "123#{:test}---#{123456789012}------------#{:abc}------------------------#{20000000}" into a binary like so (but omitting protocols):

    <<"123",
        case test of
            Arg1 when is_binary(_@1) ->
                Arg1;
            Arg1 ->
                atom_to_binary(Arg1)
        end/binary,
        "---",
        case 123456789012 of
            Arg2 when is_binary(Arg2) ->
                Arg2;
            Arg2 ->
                integer_to_binary(Arg2)
        end/binary,
        "------------",
        case abc of
            Arg3 when is_binary(Arg3) ->
                Arg3;
            Arg3 ->
                atom_to_binary(Arg3)
        end/binary,
        "------------------------",
        case 20000000 of
            Arg4 when is_binary(Arg4) ->
                Arg4;
            Arg4 ->
                integer_to_binary(Arg4)
        end/binary>>.

An elixir Regex struct gets built up as well, and passed along to String.replace/4. Not that it matters too much, but here's what that looks like :

#{
re_pattern=>{re_pattern,0,0,0,<<69,82,67,80,73,0,0,0,0,0,0,0,17,0,0,0,255,255,255,255,255,255,255,255,127,0,0,0,0,0,0,0,0,0,64,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,131,0,5,43,127,120,0,5,0>>},
opts=><<>>,
source=><<127,43>>,
'__struct__'=>'Elixir.Regex',
re_version=>{<<"8.44 2020-02-12">>,little}
}

Anyhow, the binary, the struct, the replacement, etc. get passed to String.replace/4, nothing interesting happens here, we just go straight to Regex.replace/4.

There's two paths to go down within this function :

  1. The compiled re is used when the version in the map matches the version returned by a function in the Regex module (i.e., {re:version(), erlang:system_info(endian)}.) .
  2. In the case that there is not a match, the regex (binary) is passed to re:run/3.

The code I show below reflects what happens in the case that there is no version mismatch. I also kept this simplified. But I'd be interested to know if Jose has a neat way to translate it all in one go (in-lining it all).

The only thing that comes into question for me is the re:run/3, the replacement in this case will be straight Erlang, excluding iolist_to_binary/1 ofc. I'd say a more likely suspect especially in regard to this being reproduced on Debian (although in docker) would be a PCRE issue.

Now, what's interesting to me here is the example provided to reproduce this issue. I'd be curious to know if this is actually how part of the app is started up. Maybe related updated on_load functionality introduced in OTP 26? If that were the case, I'd expect to be able to reproduce this locally, and I can not.

Any way, this was good fun and I hope this all helps 😄

-module(eh).

-export([eh/0]).

eh() ->
    S = <<"123test---123456789012------------abc------------------------20000000">>,
    %% Or just re:compile(...),
    Re =
        {re_pattern, 0, 0, 0,
            <<69, 82, 67, 80, 73, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 255, 255, 255, 255, 255, 255,
                255, 255, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 131, 0, 5, 95, 7, 120, 0, 5, 0>>},
    Opts = [{capture, all, index}, global],
    {match, List} = re:run(S, Re, Opts),
    Res = apply_list(S, [<<"?">>], List),
    iolist_to_binary(Res).

apply_list(Str, Replacment, List) ->
    apply_list(Str, Str, 0, Replacment, List).

apply_list(_, <<>>, _, _, []) ->
    [];
apply_list(_, Str, _, _, []) ->
    Str;
apply_list(
    Whole,
    Str,
    Pos,
    Replacment,
    [[{Mpos, _} | _] | _] = List
) when
    Mpos > Pos
->
    Len = Mpos - Pos,
    <<Untouched:Len/binary, Rest/binary>> = Str,
    [
        Untouched
        | apply_list(Whole, Rest, Mpos, Replacment, List)
    ];
apply_list(
    Whole,
    Str,
    Pos,
    Replacment,
    [[{Pos, Len} | _] = Head | Tail]
) ->
    <<_:Len/binary, Rest/binary>> = Str,
    NewData = apply_replace(Whole, Replacment, Head),
    [
        NewData
        | apply_list(
            Whole,
            Rest,
            Pos + Len,
            Replacment,
            Tail
        )
    ].

apply_replace(Str, {Fun, Arity}, Indexes) ->
    apply(Fun, get_indexes(Str, Indexes, Arity));
apply_replace(_, [Bin], _) when is_binary(Bin) ->
    Bin;
apply_replace(Str, Repl, Indexes) ->
    Indexes2 = list_to_tuple(Indexes),
    lists:map(
        fun(Part) ->
            case is_binary(Part) of
                true ->
                    Part;
                false ->
                    case
                        Part >=
                            tuple_size(Indexes2)
                    of
                        true ->
                            <<>>;
                        false ->
                            case true of
                                true ->
                                    get_index(
                                        Str,
                                        element(
                                            Part +
                                                1,
                                            Indexes2
                                        )
                                    );
                                false ->
                                    error(cond_clause)
                            end
                    end
            end
        end,
        Repl
    ).

get_index(_Str, {Pos, _Len}) when Pos < 0 ->
    <<>>;
get_index(Str, {Pos, Len}) ->
    <<_:Pos/binary, Res:Len/binary, _/binary>> = Str,
    Res.

get_indexes(_Str, _, 0) ->
    [];
get_indexes(Str, [], Arity) ->
    [<<>> | get_indexes(Str, [], Arity - 1)];
get_indexes(Str, [H | T], Arity) ->
    [
        get_index(Str, H)
        | get_indexes(Str, T, Arity - 1)
    ].
jhogberg commented 1 year ago

Thanks, I'm not getting much closer to the problem though -- I can't reproduce it with the minimized example :-(

As with https://github.com/erlang/otp/issues/7292 the lack of symbols in these docker images is making this really difficult. I'd really appreciate it if someone could try to reproduce this with a similar setup but with a debug-compiled emulator, or at the very least one with symbols intact.

sekiyama58 commented 1 year ago

I have rebuilt the elixir container with removing strip commands and pushed the image as sekiyama58/elixir-test:1.15.2-erlang-26.0.2-alpine-3.18-arm64. Changing the FROM line to this image, building the test Dockerfile without last RUN line, running docker run --ulimit core=-1 --privileged --rm -it <built image> /bin/sh and release/bin/test start generates the core dump on SEGV. With gdb launched by apk add gdb; gdb /usr/local/lib/erlang/erts-14.0.2/bin/beam.smp core , following stack trace was generated:

Thread 32 (LWP 41):
... (same empty call stacks) ...

Thread 4 (LWP 7):
#0  0x0000ffff85cffcc8 in ?? () from /lib/ld-musl-aarch64.so.1
#1  0x0000ffffd0246484 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 3 (LWP 70):
#0  0x0000ffff85cffccc in ?? () from /lib/ld-musl-aarch64.so.1
#1  0x0000ffff4569f430 in ?? ()
#2  0x0000000500000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 2 (LWP 68):
#0  0x0000ffff85cf1a24 in sched_yield () from /lib/ld-musl-aarch64.so.1
#1  0x0000aaaacbb0d888 in erts_thr_yield () at beam/erl_threads.h:2548
#2  sched_spin_wait (spincount=<optimized out>, ssi=0xffff44a0c900) at beam/erl_process.c:3063
#3  scheduler_wait (fcalls=fcalls@entry=0xffff3d68e95c, esdp=esdp@entry=0xffff44b0ef40, rq=rq@entry=0xffff44a0bb00) at beam/erl_process.c:3559
#4  0x0000aaaacbb19158 in erts_schedule (esdp=<optimized out>, esdp@entry=0xffff44b0ef40, p=<optimized out>, p@entry=0xffff44ec13a8, calls=<optimized out>, calls@entry=1) at beam/erl_process.c:9767
#5  0x0000aaaacbbaec60 in erts_dirty_process_main (esdp=esdp@entry=0xffff44b0ef40) at beam/beam_common.c:174
#6  0x0000aaaacbb03370 in sched_dirty_io_thread_func (vesdp=0xffff44b0ef40) at beam/erl_process.c:8771
#7  0x0000aaaacbdbcb34 in thr_wrapper (vtwd=0xffffd02463b0) at pthread/ethread.c:116
#8  0x0000ffff85d01494 in ?? () from /lib/ld-musl-aarch64.so.1
#9  0x0000ffff3d63b000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (LWP 50):
#0  erts_emasculate_writable_binary (pb=pb@entry=0xffff3a9e2b60) at beam/erl_bits.c:1845
#1  0x0000aaaacbcfab68 in re_run (p=0xffff3d4152a8, arg1=281471665187682, arg2=281471852944978, arg3=<optimized out>, first=<optimized out>) at beam/erl_bif_re.c:1311
#2  0x0000ffff425858cc in ?? ()
#3  0x0000ffff42b06d60 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further

↑ Thread 1 caused SEGV.

Other things I have tried:

(gdb) p *pb
$7 = {thing_word = 676, size = 66, next = 0x2f6131747365742d, val = 0x2d2d2d2d2d2d2d32,
  bytes = 0x2d2d2d2d2d2d2d2d <error: Cannot access memory at address 0x2d2d2d2d2d2d2d2d>, flags = 0}

It's obviously broken ..

(gdb) up
#1  0x0000aaaab1c6ab68 in re_run (p=0xffff462328f0, arg1=281471669179778, arg2=281471872278242, arg3=<optimized out>,
    first=<optimized out>) at beam/erl_bif_re.c:1311

(gdb) p *p
$9 = {common = {id = 81155, refc = {atmc = {value = 1}, sint = 1}, tracer = 59, trace_flags = 0, timer = {value = 0},
    u = {alive = {started_interval = 0, reg = 0x0, links = 0xffff3d376900, lt_monitors = 0x0, monitors = 0x0},
      release = {later = 0, func = 0x0, data = 0xffff3d376900, next = 0x0}}}, htop = 0xffff3adb6ce0,
  stop = 0xffff3adb7030, fcalls = 3983, freason = 1024, fvalue = 59, heap = 0xffff3adb6a48, hend = 0xffff3adb7190,
  abandoned_heap = 0x0, heap_sz = 233, min_heap_size = 233, min_vheap_size = 46422, max_heap_size = 3, arity = 3,
  arg_reg = 0xffff462329b8, max_arg_reg = 6, def_arg_reg = {281471669179778, 281471872278242, 281471669201946, 0, 0,
    4000}, i = 0xffff437f51b0, catches = 3, rcount = 0, schedule_count = 0, reds = 753, flags = 4194304,
  group_leader = 81027, ftrace = 59, next = 0x0, uniq = 0, sig_qs = {first = 0x0, last = 0xffff46232a30,
    save = 0xffff46232a30, cont = 0x0, cont_last = 0xffff46232a48, nmsigs = {next = 0x0, last = 0x0},
    recv_mrk_blk = 0x0, len = 0, flags = 2}, bif_timers = 0x0, dictionary = 0x0, seq_trace_clock = 0,
  seq_trace_lastcnt = 0, seq_trace_token = 59, u = {real_proc = 0x2e70b, terminate = 0x2e70b, initial = {
      module = 190219, function = 87499, arity = 4}}, current = 0xffff407bcfc0, parent = 81027, static_flags = 0,
  high_water = 0xffff3adb6b70, old_hend = 0xffff3adb20b8, old_htop = 0xffff3adb1c30, old_heap = 0xffff3adb14f8,
  gen_gcs = 7, max_gen_gcs = 65535, off_heap = {first = 0xffff3adb6ad0, overhead = 16}, wrt_bins = 0x0, mbuf = 0x0,
  live_hf_end = 0xfffffffffffffff8, msg_frag = 0x0, mbuf_sz = 0, psd = {value = 0}, bin_vheap_sz = 46422,
  bin_old_vheap_sz = 46422, bin_old_vheap = 0, sys_task_qs = 0x0, dirty_sys_tasks = 0x0, state = {value = 41002},
  dirty_state = {value = 0}, sig_inq_contention_counter = 0, sig_inq = {first = 0x0, last = 0xffff46232b78, len = 0,
    nmsigs = {next = 0x0, last = 0x0}}, sig_inq_buffers = {value = 0}, trace_msg_q = 0x0, lock = {flags = {value = 1},
    queue = {0x0, 0x0, 0x0, 0x0, 0x0}}, scheduler_data = 0xffff45cafcc0, run_queue = {value = 281471852459008}}

(gdb) x/16x arg1-2
0xffff3adb1580: 0x000002a4      0x00000000      0x00000042      0x00000000
0xffff3adb1590: 0x7365742d      0x2f613174      0x2d2d2d32      0x2d2d2d2d
0xffff3adb15a0: 0x2d2d2d2d      0x2d2d2d2d      0x00000000      0x00000000
0xffff3adb15b0: 0x2d2d2d2d      0x2d2d2d2d      0x2d2d2d2d      0x2d2d2d2d

(gdb) x/16x arg2-2
0xffff46f61ee0: 0x00000140      0x00000000      0x0000898b      0x00000000
0xffff46f61ef0: 0x0000000f      0x00000000      0x0000000f      0x00000000
0xffff46f61f00: 0x0000000f      0x00000000      0x46f61f12      0x0000ffff
0xffff46f61f10: 0x00000160      0x00000000      0x00000049      0x00000000

Hope this gives some clues ...

jhogberg commented 1 year ago

Thanks, that's very helpful! We've somehow created an over-long heap binary.

Can you give https://github.com/jhogberg/otp/tree/john/erts/shady-segfault/GH-7492 a whirl? It should crash with an illegal instruction error if the problem is where I think it is.

Edit: if it crashes in this way, please run x/5i $pc-8 :)

sekiyama58 commented 1 year ago

As you said, it crashed with an illegal instruction.

This is the instructions around udf:

Program terminated with signal SIGILL, Illegal instruction.
#0  0x0000ffff77e30ef8 in ?? ()
[Current thread is 1 (LWP 50)]
(gdb) bt
#0  0x0000ffff77e30ef8 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) x/15i $pc-32
   0xffff77e30ed8:  mov x0, x25
   0xffff77e30edc:  mov x2, x0
   0xffff77e30ee0:  bl  0xffff77e3128c
   0xffff77e30ee4:  mov x3, #0x2014                 // #8212
   0xffff77e30ee8:  b.mi    0xffff77e30e88  // b.first
   0xffff77e30eec:  add x7, x7, x0
   0xffff77e30ef0:  cmp x7, #0x130
   0xffff77e30ef4:  b.ls    0xffff77e30efc  // b.plast
=> 0xffff77e30ef8:  udf #48813
   0xffff77e30efc:  lsr x7, x7, #3
   0xffff77e30f00:  add x10, x7, #0x7
   0xffff77e30f04:  and x10, x10, #0xfffffffffffffff8
   0xffff77e30f08:  add x8, x10, #0x30
   0xffff77e30f0c:  add x2, x23, x8
   0xffff77e30f10:  cmp x2, x20
jhogberg commented 1 year ago

Thank you, that's very interesting.

Can you source the $ERL_TOP/erts/etc/unix/etp-commands file and re-run x/15i $pc-32? That should tell us which module and function we're in. Sharing the .beam file of that module would be great too.

Edit: info registers would also be nice.

frej commented 1 year ago

Perhaps unrelated, but did you notice that the back-trace mentions /lib/ld-musl-aarch64.so.1 which doesn't look like glibc (as claimed by @sekiyama58) to me, @jhogberg?

jhogberg commented 1 year ago

Yes, I think it's pretty clear that the error is independent of which libc you're using (other than causing slightly different "weather" that just so happens to expose the bug more often). As far as I can tell it's due to an underestimation in the size of the resulting binary, the only remaining question now is why that happens.

sekiyama58 commented 1 year ago

(Note: I'm testing on alpine (musl) based image, not on debian (libc) based image, because it is easy to reproduce this.)

%---------------------------------------------------------------------------
% Use etp-help for a command overview and general help.
%
% To use the Erlang support module, the environment variable ROOTDIR
% must be set to the toplevel installation directory of Erlang/OTP,
% so the etp-commands file becomes:
%     $ROOTDIR/erts/etc/unix/etp-commands
% Also, erl and erlc must be in the path.
%---------------------------------------------------------------------------
etp-set-max-depth 20
etp-set-max-string-length 100
--------------- System Information ---------------
OTP release: 26
ERTS version: 14.0.2
--Type <RET> for more, q to quit, c to continue without paging--
Arch: aarch64-unknown-linux-musl
Endianness: Little
Word size: 64-bit
BeamAsm support: yes
SMP support: yes
Thread support: yes
Kernel poll: Supported and used
Debug compiled: no
Lock checking: no
Lock counting: no
Node name: test@9ed6b554a15d
Number of schedulers: 8
Number of async-threads: 1
--------------------------------------------------
(gdb) bt
#0  0x0000ffff50a40f38 in 'Elixir.Test.Application':start/2 () at lib/application.ex:4
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) x/15i $pc-32
   0xffff50a40f18 <'Elixir.Test.Application':start/2+164>:      mov     x0, x25
   0xffff50a40f1c <'Elixir.Test.Application':start/2+168>:      mov     x2, x0
   0xffff50a40f20 <'Elixir.Test.Application':start/2+172>:
    bl  0xffff50a412cc <Elixir.Test.Application::codeFooter+64>
   0xffff50a40f24 <'Elixir.Test.Application':start/2+176>:      mov     x3, #0x2014                     // #8212
   0xffff50a40f28 <'Elixir.Test.Application':start/2+180>:
    b.mi        0xffff50a40ec8 <'Elixir.Test.Application':start/2+84>  // b.first
   0xffff50a40f2c <'Elixir.Test.Application':start/2+184>:      add     x7, x7, x0
   0xffff50a40f30 <'Elixir.Test.Application':start/2+188>:      cmp     x7, #0x130
   0xffff50a40f34 <'Elixir.Test.Application':start/2+192>:
    b.ls        0xffff50a40f3c <'Elixir.Test.Application':start/2+200>  // b.plast
=> 0xffff50a40f38 <'Elixir.Test.Application':start/2+196>:      udf     #48813
   0xffff50a40f3c <'Elixir.Test.Application':start/2+200>:      lsr     x7, x7, #3
   0xffff50a40f40 <'Elixir.Test.Application':start/2+204>:      add     x10, x7, #0x7
   0xffff50a40f44 <'Elixir.Test.Application':start/2+208>:      and     x10, x10, #0xfffffffffffffff8
   0xffff50a40f48 <'Elixir.Test.Application':start/2+212>:      add     x8, x10, #0x30
   0xffff50a40f4c <'Elixir.Test.Application':start/2+216>:      add     x2, x23, x8
   0xffff50a40f50 <'Elixir.Test.Application':start/2+220>:      cmp     x2, x20
(gdb) info reg
x0             0x40                64
x1             0x0                 0
x2             0xffff4979a192      281471914451346
x3             0x2014              8212
x4             0xffff4ba3b858      281471950764120
x5             0xffff4979a1a8      281471914451368
x6             0x3030303030303032  3472328296227680306
x7             0x228               552
x8             0x8                 8
x9             0x0                 0
x10            0x63                99
x11            0xffff4aa64000      281471934152704
x12            0x155cb             87499
x13            0xffff4979a039      281471914451001
x14            0xffff4fbd1578      281472019535224
x15            0x0                 0
x16            0x3                 3
x17            0xffff933035c0      281473151153600
x18            0xffff9339f458      281473151792216
x19            0xffff4ba3b8c0      281471950764224
x20            0xffff4979a720      281471914452768
x21            0xffff4aa69d10      281471934176528
x22            0xf7a               3962
x23            0xffff4979a1a8      281471914451368
x24            0x1                 1
x25            0xffff4979a192      281471914451346
x26            0x3474b             214859
x27            0x13bcb             80843
x28            0xa18b              41355
x29            0xffff4ba3b8c0      281471950764224
x30            0xffff50a40f24      281472034672420
sp             0xffff4ba3b8c0      0xffff4ba3b8c0
pc             0xffff50a40f38      0xffff50a40f38 <'Elixir.Test.Application':start/2+196>
cpsr           0x20001000          [ EL=0 BTYPE=0 SSBS C ]
fpsr           0x10                [ IXC ]
fpcr           0x0                 [ Len=0 Stride=0 RMode=0 ]
tpidr          0xffff4ba3fbc0      0xffff4ba3fbc0
pauth_dmask    0x7f000000000000    35747322042253312
pauth_cmask    0x7f000000000000    35747322042253312

Elixir.Test.Application.beam is the same one as I shared in https://github.com/erlang/otp/issues/7492#issuecomment-1630727989 If you're interested in, more code dump is in the gist.

sekiyama58 commented 1 year ago

BTW, I needed to commented out the following line of etp-commands since it caused an error. (Maybe because I'm using arm machine?)

   5317   set disassembly-flavor intel
Error in sourced command file:
No symbol "disassembly" in current context.
jhogberg commented 1 year ago

Thanks, this is really weird as it differs wildly from what I see on my machine after loading the exact same module.

Can you set ERL_AFLAGS to +JDdump true and extract the generated Elixir.Test.Application.asm file?

sekiyama58 commented 1 year ago

@jhogberg Added Elixir.Test.Application.asm to the ~gist~. EDIT: sorry, it was a wrong file, the correct one is here

jhogberg commented 1 year ago

Thanks, I've added another commit with a few extra asserts that should help nail down the issue.

sekiyama58 commented 1 year ago

It seems the mix release with the added code generates broken _build/dev/rel/test/bin/test:

/app # xxd _build/dev/rel/test/bin/test
00000000: 2321 2f62 696e 2f73 680a 7365 7420 2d65  #!/bin/sh.set -e
00000010: 0a0a 5345 4c46 3d24 2872 6561 646c 696e  ..SELF=$(readlin
00000020: 6b20 2224 3022 207c 7c20 7472 7565 290a  k "$0" || true).
00000030: 6966 205b 202d 7a20 2224 5345 4c46 2220  if [ -z "$SELF"
00000040: 5d3b 2074 6865 6e20 5345 4c46 3d22 2430  ]; then SELF="$0
00000050: 223b 2066 690a 5245 4c45 4153 455f 524f  "; fi.RELEASE_RO
00000060: 4f54 3d22 2428 4344 5041 5448 3d27 2720  OT="$(CDPATH=''
00000070: 6364 2022 2428 6469 726e 616d 6520 2224  cd "$(dirname "$
00000080: 5345 4c46 2229 2f2e 2e22 2026 2620 7077  SELF")/.." && pw
00000090: 6420 2d50 2922 0a65 7870 6f72 7420 5245  d -P)".export RE
000000a0: 4c45 4153 455f 524f 4f54 0a52 454c 4541  LEASE_ROOT.RELEA
000000b0: 5345 5f4e 414d 453d 2224 7b52 454c 4541  SE_NAME="${RELEA
000000c0: 5345 5f4e 414d 453a 2d22 7465 7374 2d6d  SE_NAME:-"test-m
000000d0: 6f64 6520 2452 454c 4541 5345 5f4d 4f44  ode $RELEASE_MOD
000000e0: 4500 0700 0700 0700 0700 0700 0700 0700  E...............
000000f0: 0700 0700 0d00 0d00 0e00 0e00 0e00 0d00  ................
00000100: 0d00 0800 0700 0800 0800 0800 0700 0800  ................
...
/app # release/bin/test
release/bin/test: line 84: syntax error: unterminated quoted string
sekiyama58 commented 1 year ago

Copying previously mix release'd beams to the erlang image with your added code generates the Elixir.Test.Application.asm with some indifferent addresses changes.

etp-commands didn't work this time:

(gdb) source /OTP/subdir/erts/etc/unix/etp-commands
/OTP/subdir/erts/etc/unix/etp-commands:5341: Error in sourced command file:
Cannot access memory at address 0x3898e8
(gdb) x/15i $pc-32
   0xffff46778e58:      mov     x0, x25
   0xffff46778e5c:      mov     x2, x0
   0xffff46778e60:      bl      0xffff467791fc
   0xffff46778e64:      mov     x3, #0x2014                     // #8212
   0xffff46778e68:      b.mi    0xffff46778e08  // b.first
   0xffff46778e6c:      add     x7, x7, x0
   0xffff46778e70:      cmp     x7, #0x130
   0xffff46778e74:      b.ls    0xffff46778e7c  // b.plast
=> 0xffff46778e78:      udf     #48813
   0xffff46778e7c:      lsr     x7, x7, #3
   0xffff46778e80:      add     x10, x7, #0x7
   0xffff46778e84:      and     x10, x10, #0xfffffffffffffff8
   0xffff46778e88:      add     x8, x10, #0x30
   0xffff46778e8c:      add     x2, x23, x8
   0xffff46778e90:      cmp     x2, x20
jhogberg commented 1 year ago

:-/

I've pushed another commit that kills speculative size calculation altogether, if it still crashes after that then I'm starting to suspect that the compiler is broken. What does gcc --version (or clang --version) say?

sekiyama58 commented 1 year ago

Hmm, during make docs DOC_TARGETS=chunks I got a lot of messages like below and it didn't finish.

# /OTP/subdir/bin/aarch64-unknown-linux-musl/inet_gethost[3441]: WARNING:Unknown operation requested from erlang (255), message discarded
# /OTP/subdir/bin/aarch64-unknown-linux-musl/inet_gethost[3442]: WARNING:Unknown operation requested from erlang (0), message discarded

I have skipped the doc build but Elixir runtime failed to startup .

elixir -v
/usr/local/lib/erlang/erts-14.0.2/bin/inet_gethost[49]: WARNING:Unknown operation requested from erlang (255), message discarded.
Erlang/OTP 26 [erts-14.0.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]

Runtime terminating during boot ({badarg,[{io,put_chars,[standard_io,[_]],[{_}]},{Elixir.Kernel.CLI,parse_argv,2,[{_},{_}]},{Elixir.Kernel.CLI,main,1,[{_},{_}]},{elixir,start_cli,0,[{_},{_}]},{init,start_em,1,[]},{init,do_boot,3,[]}]})

↑ I think it's just trying to print "Elixir 1.15.2 (compiled with Erlang/OTP 26)"

Alpine 3.18 gcc version is:

gcc (Alpine 12.2.1_git20220924-r10) 12.2.1 20220924
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I have also tried with Alpine 3.16, but got the same result.... gcc version is:

gcc (Alpine 11.2.1_git20220219) 11.2.1 20220219
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
jhogberg commented 1 year ago

What hardware are you running this on?

sekiyama58 commented 1 year ago

Running both on M1 Mac and on AWS Graviton 2 instances.

sekiyama58 commented 1 year ago

These are scripts I'm using for tests: https://gist.github.com/sekiyama58/08b41e2c9acb13a55460c38e429eed33

jhogberg commented 1 year ago

Thank you. Does the error disappear if you change -O2 to -O0 in CFLAGS?

sekiyama58 commented 1 year ago

Nope... ARG CFLAGS="-g -O0 -fstack-clash-protection ${CF_PROTECTION} ${PIE_CFLAGS}" still crash. Removing -fstack-clash-protection doesn't change the result either.

$ elixir -v
/usr/local/lib/erlang/erts-14.0.2/bin/inet_gethost[49]: WARNING:Unknown operation requested from erlang (255), message discarded.
Erlang/OTP 26 [erts-14.0.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]

1.15.2h qg

↑ broken version string printed :-(

$ mix release --path release:                                                                                
#0 0.352 /usr/local/lib/erlang/erts-14.0.2/bin/inet_gethost[48]: WARNING:Unknown operation requested from erlang (0), message discarded.                                                                                                        
#0 8.789 Runtime terminating during boot ({badarg,[{io,put_chars,[standard_error,[_]],[{_}]},{Elixir.Kernel.CLI,exec_fun,2,[{_},{_}]},{Elixir.Kernel.CLI,run,1,[{_},{_}]},{elixir,start_cli,0,[{_},{_}]},{init,start_em,1,[]},{init,do_boot,3,[]}]})
#0 8.799 
#0 8.799 Crash dump is being written to: erl_crash.dump...done
sekiyama58 commented 1 year ago

Shouldn't this be estimated_num_bits += (ERL_ONHEAP_BIN_LIMIT + 1) * 8; ? https://github.com/jhogberg/otp/compare/5681b20..cf23f283a755e27c56b746b3fbf72acb18d11c69#diff-ab61c77258e07dcc74ed2c0ba81b6aaabfcc2428236b9f15a9f91fb6e2c707abR2323

jhogberg commented 1 year ago

Nope... ARG CFLAGS="-g -O0 -fstack-clash-protection ${CF_PROTECTION} ${PIE_CFLAGS}" still crash.

Thanks, can you disassemble BeamModuleAssembler::emit_i_bs_create_bin?

Also, if you could build and run the debug emulator too, that'd be great. After building OTP, run make emulator TYPE=debug and then add -emu_type debug to ERL_AFLAGS

Shouldn't this be estimated_num_bits += (ERL_ONHEAP_BIN_LIMIT + 1) * 8; ? https://github.com/jhogberg/otp/compare/5681b20..cf23f283a755e27c56b746b3fbf72acb18d11c69#diff-ab61c77258e07dcc74ed2c0ba81b6aaabfcc2428236b9f15a9f91fb6e2c707abR2323

Both += and = work here, as we only check whether it's <= (ERL_ONHEAP_BIN_LIMIT) later on.

sekiyama58 commented 1 year ago

disassembled: https://gist.github.com/sekiyama58/2bcf7e360bff00a3e26c70fa03402d50

sekiyama58 commented 1 year ago

With ERL_AFLAGS=-emu_type debug, erl didn't boot:

$ elixir -v
beam/jit/arm/instr_bs.cpp:3179:emit_i_bs_create_bin() Assertion failed: seg.effectiveSize >= 0
Aborted
$ erl
beam/jit/arm/instr_bs.cpp:3179:emit_i_bs_create_bin() Assertion failed: seg.effectiveSize >= 0
Aborted

$ BINDIR=/usr/local/lib/erlang/erts-14.0.2/bin gdb /usr/local/lib/erlang/erts-14.0.2/bin/erlexec
...
beam/jit/arm/instr_bs.cpp:3179:emit_i_bs_create_bin() Assertion failed: seg.effectiveSize >= 0

Thread 1 "beam.debug.smp" received signal SIGABRT, Aborted.
0x0000ffffa30de594 in sigsetjmp () from /lib/ld-musl-aarch64.so.1
(gdb) bt
#0  0x0000ffffa30de594 in sigsetjmp () from /lib/ld-musl-aarch64.so.1
#1  0x0000ffffa30de728 in raise () from /lib/ld-musl-aarch64.so.1
#2  0x0000ffffa30aecd8 in abort () from /lib/ld-musl-aarch64.so.1
#3  0x0000aaaae0c2d240 in erl_assert_error (expr=0xaaaae0d45638 "seg.effectiveSize >= 0", func=0xaaaae0d45468 "emit_i_bs_create_bin", file=0xaaaae0d45058 "beam/jit/arm/instr_bs.cpp", line=3179) at sys/unix/sys.c:959
#4  0x0000aaaae0951f70 in BeamModuleAssembler::emit_i_bs_create_bin (this=this@entry=0xffff5d96a680, Fail=..., Alloc=..., Live0=..., Dst=..., args=...) at beam/jit/arm/instr_bs.cpp:3179
#5  0x0000aaaae0925d1c in BeamModuleAssembler::emit (this=0xffff5d96a680, specific_op=<optimized out>, args=...) at aarch64-unknown-linux-musl/debug/jit/beamasm_emit.h:357
#6  0x0000aaaae0932bb8 in beamasm_emit (instance=<optimized out>, specific_op=<optimized out>, op=<optimized out>) at beam/jit/beam_jit_main.cpp:480
#7  0x0000aaaae0917c00 in beam_load_emit_op (stp=0xffff624cc200, tmp_op=0xffff62d385b0) at beam/jit/asm_load.c:569
#8  0x0000aaaae09c15bc in load_code (stp=0xffff624cc200) at beam/beam_load.c:596
#9  0x0000aaaae09c07c4 in erts_prepare_loading (magic=0xffff624cc1d0, c_p=0x0, group_leader=59, modp=0xfffff7341810, code=0xaaaae0d048c0 <preloaded_init> "FOR1", unloaded_size=25388) at beam/beam_load.c:193
#10 0x0000aaaae09c04ec in erts_preload_module (c_p=0x0, c_p_locks=0, group_leader=59, modp=0xfffff7341810, code=0xaaaae0d048c0 <preloaded_init> "FOR1", size=25388) at beam/beam_load.c:112
#11 0x0000aaaae09fa298 in load_preloaded () at beam/erl_init.c:584
#12 0x0000aaaae0a00bcc in erl_start (argc=15, argv=0xfffff7341ae8) at beam/erl_init.c:2496
#13 0x0000aaaae08eb950 in main (argc=15, argv=0xfffff7341ae8) at sys/unix/erl_main.c:30
(gdb) up 4
#4  0x0000aaaae0951f70 in BeamModuleAssembler::emit_i_bs_create_bin (this=this@entry=0xffff5d96a680, Fail=..., Alloc=..., Live0=..., Dst=..., args=...) at beam/jit/arm/instr_bs.cpp:3179
3179                ASSERT(seg.effectiveSize >= 0);
(gdb) @
Undefined command: "".  Try "help".
(gdb) list
3174            case am_string: {
3175                ArgBytePtr string_ptr(
3176                        ArgVal(ArgVal::BytePtr, seg.src.as<ArgWord>().get()));
3177    
3178                comment("insert string");
3179                ASSERT(seg.effectiveSize >= 0);
3180                mov_imm(ARG3, seg.effectiveSize / 8);
3181                mov_arg(ARG2, string_ptr);
3182                load_erl_bits_state(ARG1);
3183    
(gdb) p seg
$1 = {type = 61259, unit = 8, flags = 0, src = {<BeamOpArg> = {type = 0, val = 0}, <No data fields>}, size = {<BeamOpArg> = {type = 73, val = 495}, <No data fields>}, error_info = 0, effectiveSize = -1, 
  action = BscSegment::action::DIRECT}
sekiyama58 commented 1 year ago

So,

        if (seg.size.isSmall() && seg.unit != 0) {
            Uint unsigned_size = seg.size.as<ArgSmall>().getUnsigned();

            if ((unsigned_size >> (sizeof(Eterm) - 1) * 8) != 0) {
            } else {
                /* This multiplication cannot overflow. */
                Uint seg_size = seg.unit * unsigned_size;
                seg.effectiveSize = seg_size;
                num_bits += seg_size;
            }
        }

was needed. With adding this, the problem is disappeared!

$ release/bin/test start
"123test---123456789012------------abc------------------------20000000"
?test---?------------abc------------------------?
...
jhogberg commented 1 year ago

That only fixes things by disabling size estimation, in a manner similar to the top commit of my branch. I'm surprised that making your changed worked, but mine did not.

I'm also very surprised that the seg.effectiveSize assertions fired, as the STORE action that asserts is used if and only if seg.effectiveSize > 0. Something seems very weird with your setup and I don't have a clue why, I'll have to sleep on it.

sekiyama58 commented 1 year ago
        if (seg.size.isSmall() && seg.unit != 0) {
            Uint unsigned_size = seg.size.as<ArgSmall>().getUnsigned();

            if ((unsigned_size >> (sizeof(Eterm) - 1) * 8) != 0) {
                /* Suppress creation of heap binary. */
                estimated_num_bits += (ERL_ONHEAP_BIN_LIMIT + 1) * 8;
            } else {
                /* This multiplication cannot overflow. */
                Uint seg_size = seg.unit * unsigned_size;
                seg.effectiveSize = seg_size;
                num_bits += seg_size; 
                estimated_num_bits += seg_size;
            }
        } else if (seg.unit > 0) {
            auto max = std::min(std::get<1>(getClampedRange(seg.size)),
                                Sint((ERL_ONHEAP_BIN_LIMIT + 1) * 8));
            // estimated_num_bits += max * seg.unit;
            estimated_num_bits += (ERL_ONHEAP_BIN_LIMIT + 1) * 8;  // <= ONLY CHANGED HERE
        } else {
            switch (seg.type) {
            case am_utf8:
            case am_utf16:
            case am_utf32:
                estimated_num_bits += 32;
                break;
            default:
                /* Suppress creation of heap binary. */
                estimated_num_bits += (ERL_ONHEAP_BIN_LIMIT + 1) * 8;
                break;
            }

also worked.

sekiyama58 commented 1 year ago

Something seems very weird with your setup

It might be because I am always testing this with mix release which generates self-contained directory, not with mix run which launch the app in place. But I don't know exactly what difference it makes.

jhogberg commented 1 year ago

The changes you’ve made shouldn’t make a difference (or if they did, the largely equal change I made should also have made a difference).

The problem seems to be either that we’ve triggered some funny undefined behavior that the compiler takes advantage of in a bad way, or the C++ compiler you’re using is buggy. I’ll try to find out which next week.

josevalim commented 1 year ago

I may have found out what it is. @sekiyama58, does it work if you do this in your mix.exs:

  releases: [test: [include_executables_for: [:unix], strip_beams: false]]

?

josevalim commented 1 year ago

@jhogberg FWIW, we were accidentally stripping the "Meta" and "Type" beams from our releases. I am not sure if there is something you can do to detect on your side, or if this can reveal bugs on .beam files from earlier Erlang/OTP versions, but it may be the root cause and it has led to at least another bug here: https://github.com/elixir-lang/elixir/issues/12795

RobPando commented 1 year ago

I may have found out what it is. @sekiyama58, does it work if you do this in your mix.exs:

  releases: [test: [include_executables_for: [:unix], strip_beams: false]]

?

This was the fix for me!! Thanks! 🎉

sekiyama58 commented 1 year ago
  releases: [test: [include_executables_for: [:unix], strip_beams: false]]

I've confirmed that this fixes the issue, thanks!

jhogberg commented 1 year ago

Thanks, a missing type chunk could explain why things behave so strangely. Everything should work without it as they're just hints, but it's not a supported configuration so it's entirely possible that there are more bugs than usual lurking around. We'll look deeper into it after the summer vacation period. :)

josevalim commented 1 year ago

Maybe it is something that would make sense to include in the fuzzer too? :) (cc @RobinMorisset)

RobinMorisset commented 1 year ago

I agree that if there is something which is optional in the BEAM, it is a good target for fuzzing. I am not entirely sure how best to achieve it though, as erlfuzz works at the Erlang source code level, so it would have to be a separate step that randomly erases some of that information in the beam file before feeding it to the VM.

michalmuskala commented 1 year ago

This could potentially work the same as the check that verifies compiler against interpreter - run once with stripped BEAMs and once without and asserting the output is the same. I don't think we'd need to randomly remove chunks - I think it could be enough to strip all optional chunks for the second run.

bjorng commented 1 year ago

I have created #7603 to test loading and executing BEAM files without any "Type" chunk. I have chosen to do it for all test suites for the bit syntax.

They all run successfully, but if I revert the fix in #7581, I get an overrun heap and stack when executing bs_match_bin_stripped_types_SUITE.

That is not absolute proof, but does seem to indicate that the bug fixed by #7581 also caused the crash reported in this issue. Therefore, I'll close this issue.