GC dumps core under OpenBSD

dancrossnyc commented 11 months ago

I'm afraid I haven't had time to really investigate this, but I wanted to jot it down before I forgot about it.

The most recent mlton, compiled from a copy from 2022, dumps core under OpenBSD; it seems to take a segfault in the GC (which is C code). Here's a quick debug session:

: samudra; lldb build/lib/mlton/mlton-compile --core mlton-compile.core
(lldb) target create "build/lib/mlton/mlton-compile" --core "mlton-compile.core"
Core file '/a/cross/ports/mlton/mlton-compile.core' (x86_64) was loaded.
(lldb) bt
* thread #1, stop reason = signal SIGSEGV
  * frame #0: 0x000006e5e74f8423 mlton-compile`returnAddressToFrameIndex + 19
    frame #1: 0x000006e5e8554a69 mlton-compile`GC_collect + 457
    frame #2: 0x000006e5e8436f04 mlton-compile`L_16797 + 59
(lldb) dis
mlton-compile`returnAddressToFrameIndex:
    0x6e5e74f8410 <+0>:  endbr64
    0x6e5e74f8414 <+4>:  movq   0x1072c15(%rip), %r11     ; __retguard_2419
    0x6e5e74f841b <+11>: xorq   (%rsp), %r11
    0x6e5e74f841f <+15>: pushq  %rbp
    0x6e5e74f8420 <+16>: movq   %rsp, %rbp
->  0x6e5e74f8423 <+19>: movl   -0x4(%rdi), %eax
    0x6e5e74f8426 <+22>: popq   %rbp
    0x6e5e74f8427 <+23>: xorq   (%rsp), %r11
    0x6e5e74f842b <+27>: cmpq   0x1072bfe(%rip), %r11     ; __retguard_2419
    0x6e5e74f8432 <+34>: je     0x43543f                  ; <+47>
    0x6e5e74f8434 <+36>: int3
    0x6e5e74f8435 <+37>: int3
    0x6e5e74f8436 <+38>: int3
    0x6e5e74f8437 <+39>: int3
    0x6e5e74f8438 <+40>: int3
    0x6e5e74f8439 <+41>: int3
    0x6e5e74f843a <+42>: int3
    0x6e5e74f843b <+43>: int3
    0x6e5e74f843c <+44>: int3
    0x6e5e74f843d <+45>: int3
    0x6e5e74f843e <+46>: int3
    0x6e5e74f843f <+47>: retq
(lldb) print/x $rdi
(unsigned long) $0 = 0x000006e5e8436f24
(lldb) x/x 0x000006e5e8436f24
0x6e5e8436f24: 0xf0c58348
(lldb) x/x 0x000006e5e8436f20
0x6e5e8436f20: 0x00000718
(lldb) ^D
: samudra;

This may be a read herring, of course, since SML code doesn't use the hardware stack and so on.

MatthewFluet commented 11 months ago

Can you say more about the steps leading up to the segmentation fault?

One possibility is that you are observing an error in the first-stage bootstrap; the old MLton is used to compile the sources of the new MLton, but because the old MLton is performing the compilation, the resulting executable is using the runtime system from the old MLton. The runtime system interface is unstable, so, unfortunately, you can't just "try with the new MLton runtime".

dancrossnyc commented 11 months ago

Indeed, that is where it is dying.

MatthewFluet commented 11 months ago

Well, you can try to perform the first-stage bootstrap with -debug-runtime true, which might give a more meaningful assertion error than a segmentation fault.

MatthewFluet commented 11 months ago

The runtime system interface is unstable, so, unfortunately, you can't just "try with the new MLton runtime".

I should revise this statement, because @ii8 has been using this approach effectively (albeit, to work around different issues); see #522 and the https://github.com/ii8/mlton-builds workflow.

dancrossnyc commented 11 months ago

Thanks! I will take a look shortly. Shortly after I wrote my last message, I broke a bone in my foot and haven't had a chance to take a look.

MatthewFluet commented 8 months ago

Should this issue remain open?

dancrossnyc commented 8 months ago

I believe so, yes, and can confirm tomorrow. Sorry, this slipped off my radar.

dancrossnyc commented 7 months ago

Sorry again for the delay, I just circled back to this.

As a first test to validate the hypothesis of interference between new and old runtime interfaces, I decided to try and compile the same version of MLton as is currently installed. So I ran, git reset --hard to the commit hash corresponding to the installed version, and ran git clean -fxd to start from a clean repository, and then tried a build.

With this, I continue to see segmentation faults as mentioned in the original bug report. This suggests something more mysterious is going on; if the it were the case that the older runtime was interacting poorly with the newer version of the code, one would assume that if the versions were the same that impedance mismatch would disappear (interface stability wouldn't matter since it's the same interface), and the result would work but that's not the case. Of course, there may be some other differences; system libraries that the runtime links against, perhaps (libc, gmp, etc). Looking at ldd against the binaries, I do see a difference in libc version.

I decided to try the debug-runtime route to see if I could get a better picture of what's actually happening. Again, this is compiling the same version that's installed, and again, I cleaned the working directory before compiling. I then ran, gmake all OLD_MLTON_COMPILE_ARGS='-debug true -debug-runtime true -keep g' WITH_GMP_DIR=/usr/local TAR=gtar CC=clang. Hmm. Clang was unhappy with -Wa,g, which is set when debug is true, so I switched to using the cc wrapper, which presumably knows how to handle such things, but got the same error, so I tried with GCC:

% gmake all OLD_MLTON_COMPILE_ARGS='-debug true -debug-runtime true -keep g' WITH_GMP_DIR=/usr/local TAR=gtar CC=egcc

Same. Ok, reading through the lines a bit here, one sees that this invocation of the compiler is really coming from mlton (the old version) itself; no worries, adding -cc egcc to OLD_MLTON_COMPILE_ARGS fixed that. So,

% gmake all OLD_MLTON_COMPILE_ARGS='-cc egcc -debug true -debug-runtime true -keep g' WITH_GMP_DIR=/usr/local TAR=gtar CC=egcc

Gives me a first-stage build with debugging enabled.

But I'm back to a seg fault with the generated artifact. Curiously, the stock gdb fails here (the stack looks like nonsense, prompting me to think that the debugger believes that the seg fault happened in SML code). lldb does much better, as does gdb installed from the package collection. Let's take a look at the debugger again:

: samudra; egdb build/lib/mlton/mlton-compile
GNU gdb (GDB) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-unknown-openbsd7.5".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from build/lib/mlton/mlton-compile...
(gdb) run
Starting program: /a/cross/ports/mlton/build/lib/mlton/mlton-compile

Program received signal SIGSEGV, Segmentation fault.
returnAddressToFrameIndex (ra=8676845349028) at /opt/local/lib/mlton/include/amd64-main.h:43
43        return *((GC_frameIndex*)(ra - sizeof(GC_frameIndex)));
(gdb) where
#0  returnAddressToFrameIndex (ra=8676845349028) at /opt/local/lib/mlton/include/amd64-main.h:43
#1  0x000007e43c5b87e1 in getFrameIndexFromReturnAddress (s=0x7e43c73e000 <gcState>, ra=8676845349028) at ./gc/frame.c:13
#2  0x000007e43c5b877a in getFrameInfoFromReturnAddress (s=0x7e43c73e000 <gcState>, ra=8676845349028) at ./gc/frame.c:36
#3  0x000007e43c5b854e in foreachObjptrInObject (s=0x7e43c73e000 <gcState>, p=0xf800000008 "", f=0x7f2b3f13b770, skipWeaks=false)
    at ./gc/foreach.c:142
#4  0x000007e43c5b745a in foreachObjptrInRange (s=0x7e43c73e000 <gcState>, front=0xf800000000 "\001", back=0x7e43c73e000 <gcState>,
    f=0x7f2b3f13b770, skipWeaks=false) at ./gc/foreach.c:190
#5  0x000007e43c5b6e5b in invariantForGC (s=0x7e43c73e000 <gcState>) at ./gc/invariant.c:120
#6  0x000007e43c5ab736 in enter (s=0x7e43c73e000 <gcState>) at ./gc/enter_leave.c:23
#7  0x000007e43c5ac4dd in GC_collect (s=0x7e43c73e000 <gcState>, bytesRequested=0, force=false) at ./gc/garbage-collection.c:222
#8  0x000007e43c48e884 in L_16796 () at mlton-compile.1.s:22
#9  0x00007f2b3f13b880 in ?? ()
#10 0x000007e43c5c0f61 in MLton_init (argc=1073741824, argv=0x7e43c4a18f4 <L_425245>, s=0xf8000035b0) at platform.c:20
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) disas
Dump of assembler code for function returnAddressToFrameIndex:
   0x000007e43b5534b0 <+0>:     endbr64
=> 0x000007e43b5534b4 <+4>:     mov    -0x4(%rdi),%eax
   0x000007e43b5534b7 <+7>:     retq
End of assembler dump.
(gdb) print/x ra
$1 = 0x7e43c48e8a4
(gdb) print/x $rdi
$2 = 0x7e43c48e8a4
(gdb) x/x $rdi
0x7e43c48e8a4 <L_16795>:        0xf0c58348
(gdb) x/x $rdi-4
0x7e43c48e8a0 <L_16796+87>:     0x00000718
(gdb) quit
A debugging session is active.

        Inferior 1 [process 37787] will be killed.

Quit anyway? (y or n) y
: samudra;

This looks pretty normal; clearly the movl can succeed, above. And oh, hey, waitaminute...what's this endbr64 doing here? Could we be doing indirect branch tracking? Hmm.... Perhaps we are: https://undeadly.org/cgi?action=article;sid=20230714121907

Perhaps OpenBSD traps the #CP and reflects it as a SIGSEGV? The old MLton does not have any endbr64 instructions, but the new one sure does:

: samudra; objdump -d /opt/local/lib/mlton/mlton-compile | grep endbr64 | wc -l
       0
: samudra; objdump -d build/lib/mlton/mlton-compile | grep endbr64 | wc -l
     255
: samudra;

Let's try building with the link option to remove IBT enforcement, as specified in the undeadly.org article and see what happens. Hmm, same error; this is a red herring. Indeed, looking at the OpenBSD kernel code, #CP is reflected into SIGILL, not SIGSEGV: trap sources for SIGSEGV are stack faults, page faults, and GPFs. I can't see how any of these are at play, unless the text segment is currently mapped non-executable or something. To see which trap is being generated, I need a debug kernel.

Hmm before doing that, looking at what @ii8 was doing as mentioned above, I wonder: what happens if I try to use the c codegen option? That seems to get me a MLton that runs, but now the segfault has shifted to mllex (and presumably other binaries built by the new compiler). Curiously, the fault is in the same function as earlier; this suggests that if I use the C codegen I might get working binaries, but still doesn't explain why other sorts of binaries dump core. And indeed, running with -codegen c does seem to produce a working mllex binary (and mlyacc etc too).

I've got to get back to my day job now, but I'll try to circle back in a bit.

dancrossnyc commented 7 months ago

I found the issue.

The text segment coming out of the compiler is execute-only (https://marc.info/?l=openbsd-tech&m=167374666324119&w=2) and we're taking a page fault trying to read an address from said text segment (note that the faulting instruction all along has been a load from a seemingly innocuous address). Of course, this doesn't fault in the debugger since the debugger must (necessarily) remap the text segment to be readable. Btw, it helped that every program faulted in a similar way, meaning I could test with something trivial, like a "hello world".

I went ahead and built a debugging kernel (really, just a normal kernel but with a debugging print statement added in the SIGSEGV delivery path) that let me a) confirm that SML programs compiled with MLton are indeed taking page faults, and b) capture the faulting address. Here's an example of the output from a faulting SML program, as extracted from dmesg:

trap 6 code 25 rip c975ac9ff64 cs 23 rflags 10203 cr2 c975aca3a20 cpl 0
curproc 0xffff80006bc63ab8
pid 8197

Note the PC and %cr2. A quick debugging session gives a few more clues:

(gdb) bt
#0  0x00000c975ac9ff64 in returnAddressToFrameIndex ()
#1  0x00000c975acad229 in GC_collect ()
#2  0x00000c975aca3a04 in L_444 ()
#3  0x000071bcee63b158 in ?? ()
#4  0x00000c99f4bd2a70 in ?? () from /usr/libexec/ld.so
#5  0x0000000000000000 in ?? ()
(gdb) print $rdi-4
$5 = 13844202797600
(gdb) print/x $rdi-4
$6 = 0xc975aca3a20
(gdb) bt
#0  0x00000c975ac9ff64 in returnAddressToFrameIndex ()
#1  0x00000c975acad229 in GC_collect ()
#2  0x00000c975aca3a04 in L_444 ()
#3  0x000071bcee63b158 in ?? ()
#4  0x00000c99f4bd2a70 in ?? () from /usr/libexec/ld.so
#5  0x0000000000000000 in ?? ()
(gdb) print/x $rdi-4
$7 = 0xc975aca3a20
(gdb) x/x 0xc975aca3a20
0xc975aca3a20 <L_444+87>:       0x0000001e
(gdb) disas
Dump of assembler code for function returnAddressToFrameIndex:
   0x00000c975ac9ff60 <+0>:     endbr64
=> 0x00000c975ac9ff64 <+4>:     mov    -0x4(%rdi),%eax
   0x00000c975ac9ff67 <+7>:     retq
End of assembler dump.
(gdb)

By lowering the kern.securelevel sysctl on this single-user machine, I was able to use procmap to read mappings of the faulting program. The output is a bit voluminous, but the relevant line is here:

00000c975ac9f000-00000c975acb6fff      96k 0000000000007000 --x--Ip- (rwx) 1/0/0 04:09 35815570 - /a/cross/hello [0xfffffd8e3356fcf8]

Note the "--x"; this denotes a segment that is "xonly". Clearly, the faulting address we identified fell under this region.

Ok. So how do we fix it? The linker supports a, --no-execute-only option to disable the xonly behavior; let's try building with that:

: samudra; mlton -link-opt -Wl,--no-execute-only hello.sml
: samudra; ./hello
Hello, World!
: samudra;

Success. Huzzah!

MatthewFluet commented 7 months ago

Yes, the native codegens store GC information in the text segment, in the memory immediately preceeding a return address. The C and LLVM codegens use a different mechanism, and so wouldn't trigger that behavior.

dancrossnyc commented 7 months ago

Yes, the native codegens store GC information in the text segment, in the memory immediately preceeding a return address. The C and LLVM codegens use a different mechanism, and so wouldn't trigger that behavior.

It may be worth investigating how to move that information into, say, a read-only data segment. But that feels like a much bigger lift, so I just sent a PR that disables the "execute only" behavior via -target-link-opt in mlton-script.

MLton / mlton

GC dumps core under OpenBSD #538