Closed dancrossnyc closed 7 months ago
Can you say more about the steps leading up to the segmentation fault?
One possibility is that you are observing an error in the first-stage bootstrap; the old MLton is used to compile the sources of the new MLton, but because the old MLton is performing the compilation, the resulting executable is using the runtime system from the old MLton. The runtime system interface is unstable, so, unfortunately, you can't just "try with the new MLton runtime".
Indeed, that is where it is dying.
Well, you can try to perform the first-stage bootstrap with -debug-runtime true
, which might give a more meaningful assertion error than a segmentation fault.
The runtime system interface is unstable, so, unfortunately, you can't just "try with the new MLton runtime".
I should revise this statement, because @ii8 has been using this approach effectively (albeit, to work around different issues); see #522 and the https://github.com/ii8/mlton-builds workflow.
Thanks! I will take a look shortly. Shortly after I wrote my last message, I broke a bone in my foot and haven't had a chance to take a look.
Should this issue remain open?
I believe so, yes, and can confirm tomorrow. Sorry, this slipped off my radar.
Sorry again for the delay, I just circled back to this.
As a first test to validate the hypothesis of interference between new and old runtime interfaces, I decided to try and compile the same version of MLton as is currently installed. So I ran, git reset --hard
to the commit hash corresponding to the installed version, and ran git clean -fxd
to start from a clean repository, and then tried a build.
With this, I continue to see segmentation faults as mentioned in the original bug report. This suggests something more mysterious is going on; if the it were the case that the older runtime was interacting poorly with the newer version of the code, one would assume that if the versions were the same that impedance mismatch would disappear (interface stability wouldn't matter since it's the same interface), and the result would work but that's not the case. Of course, there may be some other differences; system libraries that the runtime links against, perhaps (libc, gmp, etc). Looking at ldd
against the binaries, I do see a difference in libc
version.
I decided to try the debug-runtime
route to see if I could get a better picture of what's actually happening. Again, this is compiling the same version that's installed, and again, I cleaned the working directory before compiling. I then ran, gmake all OLD_MLTON_COMPILE_ARGS='-debug true -debug-runtime true -keep g' WITH_GMP_DIR=/usr/local TAR=gtar CC=clang
. Hmm. Clang was unhappy with -Wa,g
, which is set when debug
is true, so I switched to using the cc
wrapper, which presumably knows how to handle such things, but got the same error, so I tried with GCC:
% gmake all OLD_MLTON_COMPILE_ARGS='-debug true -debug-runtime true -keep g' WITH_GMP_DIR=/usr/local TAR=gtar CC=egcc
Same. Ok, reading through the lines a bit here, one sees that this invocation of the compiler is really coming from mlton
(the old version) itself; no worries, adding -cc egcc
to OLD_MLTON_COMPILE_ARGS
fixed that. So,
% gmake all OLD_MLTON_COMPILE_ARGS='-cc egcc -debug true -debug-runtime true -keep g' WITH_GMP_DIR=/usr/local TAR=gtar CC=egcc
Gives me a first-stage build with debugging enabled.
But I'm back to a seg fault with the generated artifact. Curiously, the stock gdb
fails here (the stack looks like nonsense, prompting me to think that the debugger believes that the seg fault happened in SML code). lldb
does much better, as does gdb
installed from the package collection. Let's take a look at the debugger again:
: samudra; egdb build/lib/mlton/mlton-compile
GNU gdb (GDB) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-unknown-openbsd7.5".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from build/lib/mlton/mlton-compile...
(gdb) run
Starting program: /a/cross/ports/mlton/build/lib/mlton/mlton-compile
Program received signal SIGSEGV, Segmentation fault.
returnAddressToFrameIndex (ra=8676845349028) at /opt/local/lib/mlton/include/amd64-main.h:43
43 return *((GC_frameIndex*)(ra - sizeof(GC_frameIndex)));
(gdb) where
#0 returnAddressToFrameIndex (ra=8676845349028) at /opt/local/lib/mlton/include/amd64-main.h:43
#1 0x000007e43c5b87e1 in getFrameIndexFromReturnAddress (s=0x7e43c73e000 <gcState>, ra=8676845349028) at ./gc/frame.c:13
#2 0x000007e43c5b877a in getFrameInfoFromReturnAddress (s=0x7e43c73e000 <gcState>, ra=8676845349028) at ./gc/frame.c:36
#3 0x000007e43c5b854e in foreachObjptrInObject (s=0x7e43c73e000 <gcState>, p=0xf800000008 "", f=0x7f2b3f13b770, skipWeaks=false)
at ./gc/foreach.c:142
#4 0x000007e43c5b745a in foreachObjptrInRange (s=0x7e43c73e000 <gcState>, front=0xf800000000 "\001", back=0x7e43c73e000 <gcState>,
f=0x7f2b3f13b770, skipWeaks=false) at ./gc/foreach.c:190
#5 0x000007e43c5b6e5b in invariantForGC (s=0x7e43c73e000 <gcState>) at ./gc/invariant.c:120
#6 0x000007e43c5ab736 in enter (s=0x7e43c73e000 <gcState>) at ./gc/enter_leave.c:23
#7 0x000007e43c5ac4dd in GC_collect (s=0x7e43c73e000 <gcState>, bytesRequested=0, force=false) at ./gc/garbage-collection.c:222
#8 0x000007e43c48e884 in L_16796 () at mlton-compile.1.s:22
#9 0x00007f2b3f13b880 in ?? ()
#10 0x000007e43c5c0f61 in MLton_init (argc=1073741824, argv=0x7e43c4a18f4 <L_425245>, s=0xf8000035b0) at platform.c:20
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) disas
Dump of assembler code for function returnAddressToFrameIndex:
0x000007e43b5534b0 <+0>: endbr64
=> 0x000007e43b5534b4 <+4>: mov -0x4(%rdi),%eax
0x000007e43b5534b7 <+7>: retq
End of assembler dump.
(gdb) print/x ra
$1 = 0x7e43c48e8a4
(gdb) print/x $rdi
$2 = 0x7e43c48e8a4
(gdb) x/x $rdi
0x7e43c48e8a4 <L_16795>: 0xf0c58348
(gdb) x/x $rdi-4
0x7e43c48e8a0 <L_16796+87>: 0x00000718
(gdb) quit
A debugging session is active.
Inferior 1 [process 37787] will be killed.
Quit anyway? (y or n) y
: samudra;
This looks pretty normal; clearly the movl
can succeed, above. And oh, hey, waitaminute...what's this endbr64
doing here? Could we be doing indirect branch tracking? Hmm.... Perhaps we are: https://undeadly.org/cgi?action=article;sid=20230714121907
Perhaps OpenBSD traps the #CP
and reflects it as a SIGSEGV
? The old MLton does not have any endbr64
instructions, but the new one sure does:
: samudra; objdump -d /opt/local/lib/mlton/mlton-compile | grep endbr64 | wc -l
0
: samudra; objdump -d build/lib/mlton/mlton-compile | grep endbr64 | wc -l
255
: samudra;
Let's try building with the link option to remove IBT enforcement, as specified in the undeadly.org article and see what happens. Hmm, same error; this is a red herring. Indeed, looking at the OpenBSD kernel code, #CP
is reflected into SIGILL, not SIGSEGV: trap sources for SIGSEGV
are stack faults, page faults, and GPFs. I can't see how any of these are at play, unless the text segment is currently mapped non-executable or something. To see which trap is being generated, I need a debug kernel.
Hmm before doing that, looking at what @ii8 was doing as mentioned above, I wonder: what happens if I try to use the c
codegen option? That seems to get me a MLton that runs, but now the segfault has shifted to mllex
(and presumably other binaries built by the new compiler). Curiously, the fault is in the same function as earlier; this suggests that if I use the C codegen I might get working binaries, but still doesn't explain why other sorts of binaries dump core. And indeed, running with -codegen c
does seem to produce a working mllex
binary (and mlyacc
etc too).
I've got to get back to my day job now, but I'll try to circle back in a bit.
I found the issue.
The text segment coming out of the compiler is execute-only (https://marc.info/?l=openbsd-tech&m=167374666324119&w=2) and we're taking a page fault trying to read an address from said text segment (note that the faulting instruction all along has been a load from a seemingly innocuous address). Of course, this doesn't fault in the debugger since the debugger must (necessarily) remap the text segment to be readable. Btw, it helped that every program faulted in a similar way, meaning I could test with something trivial, like a "hello world".
I went ahead and built a debugging kernel (really, just a normal kernel but with a debugging print statement added in the SIGSEGV delivery path) that let me a) confirm that SML programs compiled with MLton are indeed taking page faults, and b) capture the faulting address. Here's an example of the output from a faulting SML program, as extracted from dmesg
:
trap 6 code 25 rip c975ac9ff64 cs 23 rflags 10203 cr2 c975aca3a20 cpl 0
curproc 0xffff80006bc63ab8
pid 8197
Note the PC and %cr2
. A quick debugging session gives a few more clues:
(gdb) bt
#0 0x00000c975ac9ff64 in returnAddressToFrameIndex ()
#1 0x00000c975acad229 in GC_collect ()
#2 0x00000c975aca3a04 in L_444 ()
#3 0x000071bcee63b158 in ?? ()
#4 0x00000c99f4bd2a70 in ?? () from /usr/libexec/ld.so
#5 0x0000000000000000 in ?? ()
(gdb) print $rdi-4
$5 = 13844202797600
(gdb) print/x $rdi-4
$6 = 0xc975aca3a20
(gdb) bt
#0 0x00000c975ac9ff64 in returnAddressToFrameIndex ()
#1 0x00000c975acad229 in GC_collect ()
#2 0x00000c975aca3a04 in L_444 ()
#3 0x000071bcee63b158 in ?? ()
#4 0x00000c99f4bd2a70 in ?? () from /usr/libexec/ld.so
#5 0x0000000000000000 in ?? ()
(gdb) print/x $rdi-4
$7 = 0xc975aca3a20
(gdb) x/x 0xc975aca3a20
0xc975aca3a20 <L_444+87>: 0x0000001e
(gdb) disas
Dump of assembler code for function returnAddressToFrameIndex:
0x00000c975ac9ff60 <+0>: endbr64
=> 0x00000c975ac9ff64 <+4>: mov -0x4(%rdi),%eax
0x00000c975ac9ff67 <+7>: retq
End of assembler dump.
(gdb)
By lowering the kern.securelevel
sysctl on this single-user machine, I was able to use procmap
to read mappings of the faulting program. The output is a bit voluminous, but the relevant line is here:
00000c975ac9f000-00000c975acb6fff 96k 0000000000007000 --x--Ip- (rwx) 1/0/0 04:09 35815570 - /a/cross/hello [0xfffffd8e3356fcf8]
Note the "--x"; this denotes a segment that is "xonly". Clearly, the faulting address we identified fell under this region.
Ok. So how do we fix it? The linker supports a, --no-execute-only
option to disable the xonly
behavior; let's try building with that:
: samudra; mlton -link-opt -Wl,--no-execute-only hello.sml
: samudra; ./hello
Hello, World!
: samudra;
Success. Huzzah!
Yes, the native codegens store GC information in the text segment, in the memory immediately preceeding a return address. The C and LLVM codegens use a different mechanism, and so wouldn't trigger that behavior.
Yes, the native codegens store GC information in the text segment, in the memory immediately preceeding a return address. The C and LLVM codegens use a different mechanism, and so wouldn't trigger that behavior.
It may be worth investigating how to move that information into, say, a read-only data segment. But that feels like a much bigger lift, so I just sent a PR that disables the "execute only" behavior via -target-link-opt
in mlton-script
.
I'm afraid I haven't had time to really investigate this, but I wanted to jot it down before I forgot about it.
The most recent mlton, compiled from a copy from 2022, dumps core under OpenBSD; it seems to take a segfault in the GC (which is C code). Here's a quick debug session:
This may be a read herring, of course, since SML code doesn't use the hardware stack and so on.