Segfaults with 128-bit integers on Windows

tahina-pro commented 4 years ago

With my colleague @msprotz, we observed segfaults related to 128-bit integers in some Windows OCaml programs using ocaml-stdint.

It turns out that even the test executable produced by make test/stdint_test segfaults, with no test running, as we show below.

We are working on Windows 10 version 1909 (10.0.18363.720) with Cygwin 3.1.4-1, mingw64-x86_64-gcc 9.2.0, and OCaml 4.09.0 installed via OCaml and OPaM for Windows. If you want to quickly reproduce our config, you can pull a complete Docker image pulling the latest ocaml-stdint master (0785788f425bd06635608fe111a851ec86e095bf) and building ocaml-stdint and its test suite. You can also rebuild the Docker image yourself using this Dockerfile.txt.

Then, below is what we are observing:

ContainerAdministrator@cff3016f7f7e /cygdrive/c/test/ocaml-stdint
$ gdb tests/stdint_test.exe
GNU gdb (GDB) (Cygwin 8.2.1-1) 8.2.1

(gdb) run
Starting program: /cygdrive/c/test/ocaml-stdint/tests/stdint_test.exe
[New Thread 5000.0x33c4]
[New Thread 5000.0x32b0]
[New Thread 5000.0x5110]
[New Thread 5000.0x5710]

Thread 1 received signal SIGSEGV, Segmentation fault.
copy_uint128 (i=<optimized out>) at uint128_stubs.c:188
188       Uint128_val(res) = i;
(gdb) backtrace
#0  copy_uint128 (i=<optimized out>) at uint128_stubs.c:188
#1  0x0000000000494e65 in uint128_of_int (v=<optimized out>) at uint128_conv.c:34
#2  0x000000000042f2e6 in camlStdint__entry ()
#3  0x0000000000000001 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

As a workaround, my colleague @msprotz proposed to recompile ocaml-stdint with gcc optimizations disabled, which we did in tahina-pro/ocaml-stdint@99ac1fa234db8d57c4a43837ae71fda51d8d1470, and there we no longer observe these segfaults.

rixed commented 4 years ago

Do you know if your version of mingw supports the __int128_t type? Windows docker image cannot run on Linux so I cannot repro your exact issue, but I also have a segfault when running the tests without HAVE_INT128, so maybe that's related?

tahina-pro commented 4 years ago

Yes, in a container running from the Docker image, I can successfully compile and run a __int128_t function that computes 2^(2^n):

docker run -i -t tahina/20200318ocamlstdint

ContainerAdministrator@52e7a61f5bdc /cygdrive/c/test/ocaml-stdint
$ cat <<EOF >a.c
#include <stdint.h>
#include <assert.h>
__int128_t two_pow_two_pow_minus_one(int count) {
  __int128_t res = 2;
  for (int i = 0; i < count; ++i) {
    res *= res;
  }
  assert (res > 0);
  return res - 1;
}
EOF

$ x86_64-w64-mingw32-gcc.exe -Wall -c a.c

$ cat <<EOF >test.c
#include <inttypes.h>
#include <stdio.h>
__int128_t two_pow_two_pow_minus_one(int count);
int main(void) {
  uint64_t res = (uint64_t) (two_pow_two_pow_minus_one(6));
  printf("%" PRIu64 "\n", res);
  return 0;
}
EOF

$ x86_64-w64-mingw32-gcc.exe -Wall -c test.c

$ x86_64-w64-mingw32-gcc.exe -Wall -o test.exe a.o test.o

$ ./test.exe
18446744073709551615

$ echo $?
0

hcarty commented 4 years ago

If it helps - using an older 4.08.0 installation, also using fdopen's opam distribution, I don't get a segfault. On 4.10.0 I do get it.

rseymour commented 4 years ago

Tried compiling the above test C progs with -O3 and they compiled/ran without error. My prior test w/ an ifdef check also worked.

msprotz commented 4 years ago

I should mention that the error started happening after we upgraded GCC (specifically: mingw) to version 9. We were previously on version 7 and there was no trouble.

rseymour commented 4 years ago

Thread 1 received signal SIGSEGV, Segmentation fault.
0x000000000044e87d in copy_int128 (i=<optimized out>) at int128_stubs.c:146
146       Int128_val(res) = i;

This gdb output seems to be saying that GCC optimized out the variable i in the copy_int128() function. I tried throwing a volatile next to it in the args to the function, but it didn't seem to stop it. Feels like gcc at -O2 shouldn't optimize out the right hand side of an assignment, but who knows.

rseymour commented 4 years ago

I found perl folks wrap code with a lot of guards to protect it from new GCC... https://github.com/perl11/cperl/commit/c0b59c82b053a3434cf3c20d36d1a3406421bc3d (this was a failed(?) commit to add even more -O0 style guards to the existing ones)

rgrinberg commented 3 years ago

@rseymour do you think we should such guards as well?

rseymour commented 3 years ago

We didn't end up going with them as the other fix worked, but I feel like it's the "right" way to do it until GCC fixes itself.

msprotz commented 3 years ago

Just coming back to this issue after a long while...

Are we positive that this is a GCC bug? I can't tell from the previous discussion what the "misbehavior" might be. The fact that we observe the issue at O2 and above doesn't necessarily reflect a GCC issue; it could be that the bindings code does something illegal per the C standard, and that GCC only becomes "smart" at O2.

Maybe the issue is this: when HAVE_INT128 is absent, some unspecified issue observable only at -O2 and above causes a segfault.

Then there are two fixes:

make HAVE_INT128 enabled on windows/mingw
fix the underlying issue.

Is this an accurate summary?

Would you be open to the first fix?

FYI, we're still running our entire infrastructure with the workaround outlined in this bug: we compile the test program, see if it segfaults, and if it does, recompile ocaml-stdint with -O0 to prevent the error from happening.

Thanks!

Jonathan