SWI-Prolog / issues

Dummy repository for issue tracking
7 stars 3 forks source link

Test failures on Alpine Linux #103

Closed brebs-gh closed 2 years ago

brebs-gh commented 2 years ago

Hi, the new (draft) swi-prolog package for Alpine Linux (which uses Musl) shows some architecture-specific tests failing:

(In that merge request link, click on the green tick near "merge request pipeline" to show a drop-down of the architectures, then click on each architecture for the build log.)

The mqi:mqi test is mentioned on Discourse.

JanWielemaker commented 2 years ago

Please pass --output-on-failure to ctest so we get some idea about what is going wrong. I got an update from @EricZinda that might address the mqi failure.

brebs-gh commented 2 years ago

The ctest details can be seen on the arch-specific build logs, e.g. mqi on x86 is missing python3.

python3 is now added.

JanWielemaker commented 2 years ago

Created an issue for protobufs at https://github.com/SWI-Prolog/contrib-protobufs/issues/14. I guess you can deal with the python3 dependency by adding this to the build requirements? Not sure what to make of the core test failure for s390x. Given that there is no error report suggests a C stack overflow. I know that one of these tests also fails when using AddressSanitizer with default stack sizes. Would it be easy to use ulimit to raise the limit significantly?

ktprograms commented 2 years ago

@JanWielemaker The download link for the SWI-Prolog source isn't working anymore (I'm using https://www.swi-prolog.org/download/devel/src/swipl-8.5.0.tar.gz) and the GitHub tarballs don't seem to have the git submodules included.

JanWielemaker commented 2 years ago

Oops. My cleanup of old source archives was a bit too radical :cry: They are back again. It might take a little for the CDN to pick this up. eu.swi-prolog.org should have it immediately. Note there is now 8.5.1 :smile:

ktprograms commented 2 years ago

@JanWielemaker I'm doing the best I can to figure out the s390x segfault, but QEMU doesn't implement ptrace. When reading the core dump with gdb and running bt, it says Backtrace stopped: previous frame identical to this frame (corrupt stack?), so I guess you might be right about stack problems.

JanWielemaker commented 2 years ago

Thanks so far.

That is the nasty thing about stack overflows :cry: AFAIK there is no portable way to deal with these elegantly. Maybe I should give non-portable options a go ... Anyway, why not simply use ulimit to raise the limit and see whether that helps? If the crash vanishes we know this was the problem and I propose to simply raise the limit for running the tests if this is easy.

If that doesn't help, do you have an easy to follow recipe to get the Qemu environment that your using running?

ktprograms commented 2 years ago

Anyway, why not simply use ulimit to raise the limit and see whether that helps?

I'm trying to (with ulimit -s 16384) but for some reason it's not changing. I think it might be a QEMU bug. I'll let you know when I figure out how to increase the stack size if that was the fix.

If that doesn't help, do you have an easy to follow recipe to get the Qemu environment that your using running?

I'm basically following https://wiki.alpinelinux.org/wiki/How_to_make_a_cross_architecture_chroot (which I wrote), then building the alpine package after setting up the build environment as shown in https://wiki.alpinelinux.org/wiki/Creating_an_Alpine_package.

JanWielemaker commented 2 years ago

Thanks. I'm afraid I'm a bit too busy to dive into this quickly. If nothing changes I'll give qemu a shot next week.

ktprograms commented 2 years ago

In the end I used QEMU full system emulation, and even setting ulimit -s unlimited still causes a segfault.

If you need more info (gdb, etc), I should be able to provide it.

JanWielemaker commented 2 years ago

Unexpected. Maybe we should first figure out which test is to blame. You do that as follows:

  1. Run ctest -V -R core to get the commandline executed. You can interrupt the test itself immediately as we only want the commandline.
  2. Copy/paste the commandline and remove the -q to make it more verbose. That prints the executed test files, the unit test bodies and a "." for each test that succeeded. Identify the failing test file.
  3. The test file is in ../src/Tests/core. Find the test unit and count the dots until you have the test.
  4. Run src/swipl ../src/Tests/core/<file>
  5. At the prompt, run ?- run_tests(unit:test).

Hopefully this reproduces. That should give some hints and makes reproducing a lot quicker. You can then run under gdb. You may also apply SWI-Prolog/swipl-devel@a375c6d0210d4ed62299815d96b79594820578ce, which may make the crash more verbose.

ktprograms commented 2 years ago

The segfaulting test is setup_call_cleanup:error_choice. How can I run under gdb?

Also https://github.com/SWI-Prolog/swipl-devel/commit/a375c6d0210d4ed62299815d96b79594820578ce doesn't seem to change any output (in the CMake output sigaltstack is found)

JanWielemaker commented 2 years ago

Interesting. Thanks. Hmm. This doesn't smell like a C stack exhaustion. Possibly a corruption, though that is not that likely as well. For gdb, do

gdb --args swipl args ...
(gdb) run
<crash> (hopefully)
(gdb) bt all

Preferably use the debug build (see CMAKE.md), but the bug may vanish, in which case we have to live with the default build.

ktprograms commented 2 years ago

I've run it 3 times, and twice I got this backtrace:

?- run_tests(setup_call_cleanup:error_choice).
% PL-Unit: setup_call_cleanup:error_choice 
Thread 1 "swipl" received signal SIGSEGV, Segmentation fault.
0x000003fffdc65c18 in do_unify___LD (__PL_ld=0x3fffde965b0 <PL_local_data>, t1=0x3fffd67e518, t2=0x276d16e5674010) at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-prims.c:276
276     deRef(t2); w2 = *t2;
(gdb) bt
#0  0x000003fffdc65c18 in do_unify___LD (
    __PL_ld=0x3fffde965b0 <PL_local_data>, t1=0x3fffd67e518, 
    t2=0x276d16e5674010)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-prims.c:276
#1  0x000003fffdc6674c in raw_unify_ptrs___LD (
    __PL_ld=0x3fffde965b0 <PL_local_data>, t1=0x3fffd67e518, t2=0x3fffd685000)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-prims.c:421
#2  0x000003fffdc66b4e in unify_ptrs___LD (
    __PL_ld=0x3fffde965b0 <PL_local_data>, t1=0x3fffd67e518, t2=0x3fffd685000, 
    flags=3)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-prims.c:507
#3  0x000003fffdc6705c in can_unify (t1=0x3fffd67e518, t2=0x3fffd685000, 
    ex=547)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-prims.c:550
#4  0x000003fffdbb7ea4 in isCaughtInOuterQuery___LD (
    __PL_ld=0x3fffde965b0 <PL_local_data>, qid=479, ball=0)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-wam.c:2005
#5  0x000003fffdbb894e in exception_hook___LD (
    __PL_ld=0x3fffde965b0 <PL_local_data>, pqid=479, fr=513, catchfr_ref=0)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-wam.c:2128
#6  0x000003fffdbd78f8 in PL_next_solution___LD (
    __PL_ld=0x3fffde965b0 <PL_local_data>, qid=479)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-vmi.c:5147
--Type <RET> for more, q to quit, c to continue without paging--
#7  0x000003fffdbb2c24 in call_term___LD (
    __PL_ld=0x3fffde965b0 <PL_local_data>, mdef=0x3fffda19f60, goal=409)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-wam.c:800
#8  0x000003fffdbb2ea8 in callCleanupHandler___LD (
    __PL_ld=0x3fffde965b0 <PL_local_data>, fr=0x3fffd685c70, 
    reason=FINISH_EXTERNAL_EXCEPT_UNDO)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-wam.c:847
#9  0x000003fffdbb2ffc in frameFinished___LD (
    __PL_ld=0x3fffde965b0 <PL_local_data>, fr=0x3fffd685c70, 
    reason=FINISH_EXTERNAL_EXCEPT_UNDO)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-wam.c:866
#10 0x000003fffdbb9bc2 in discardChoicesAfter___LD (
    __PL_ld=0x3fffde965b0 <PL_local_data>, fr=0x3fffd685c18, 
    reason=FINISH_EXTERNAL_EXCEPT_UNDO)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-wam.c:2450
#11 0x000003fffdbba2c4 in dbg_discardChoicesAfter___LD (
    __PL_ld=0x3fffde965b0 <PL_local_data>, fr=0x3fffd685c18, 
    reason=FINISH_EXTERNAL_EXCEPT_UNDO)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-wam.c:2502
#12 0x000003fffdbda10a in PL_next_solution___LD (
    __PL_ld=0x3fffde965b0 <PL_local_data>, qid=25)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-vmi.c:5344
#13 0x000003fffdc7f3c2 in query_loop (goal=32261, loop=1)
--Type <RET> for more, q to quit, c to continue without paging--
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-pro.c:147
#14 0x000003fffdc80442 in prologToplevel (goal=32261)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-pro.c:496
#15 0x000003fffddaf9e2 in PL_toplevel ()
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-fli.c:4558
#16 0x000002aa00000a38 in main (argc=2, argv=0x3fffffffcc8)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-main.c:143

But the first time I did it I got this backtrace: (ignore the no symbol all part)

?- run_tests(setup_call_cleanup:error_choice).

Thread 1 "swipl" received signal SIGSEGV, Segmentation fault.
0x000003fffdc6d41a in PL_same_term___LD (__PL_ld=0x3fffde965b0 <PL_local_data>, T1=175, T2=0) at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-prims.c:2103
2103      deRef(t2);
(gdb) bt all
No symbol "all" in current context.
(gdb) bt
#0  0x000003fffdc6d41a in PL_same_term___LD (
    __PL_ld=0x3fffde965b0 <PL_local_data>, T1=175, T2=0)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-prims.c:2103
#1  0x000003fffdbb8e4c in exception_hook___LD (
    __PL_ld=0x3fffde965b0 <PL_local_data>, pqid=25, fr=149, catchfr_ref=0)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-wam.c:2168
#2  0x000003fffdbd78f8 in PL_next_solution___LD (
    __PL_ld=0x3fffde965b0 <PL_local_data>, qid=25)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-vmi.c:5147
#3  0x000003fffdc7f3c2 in query_loop (goal=32261, loop=1)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-pro.c:147
#4  0x000003fffdc80442 in prologToplevel (goal=32261)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-pro.c:496
#5  0x000003fffddaf9e2 in PL_toplevel ()
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-fli.c:4558
#6  0x000002aa00000a38 in main (argc=2, argv=0x3fffffffcc8)
    at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-main.c:143

It appears that the second backtrace (the 6 long one) happens when I start typing before the prompt shows up (but then delete the other input and type in the segfaulting code)

JanWielemaker commented 2 years ago

Thanks. Seems some term is corrupted during exception handling. I guess there are two ways out, one is to add more safety tests and hope this will show up on other platforms too and the other is for me to install the Qemu VM. At least, it is not a stack overflow.

JanWielemaker commented 2 years ago

I understand you use a full VM now. I found https://wiki.qemu.org/Documentation/Platforms/S390X. qemu-system-s390x is in Ubuntu. Which kernel image did you use?

ktprograms commented 2 years ago

I downloaded https://dl-cdn.alpinelinux.org/alpine/v3.14/releases/s390x/alpine-standard-3.14.2-s390x.iso, and ran it with this command line:

qemu-system-s390x \
-smp cpus=1,sockets=1,cores=1,threads=1 \
-machine s390-ccw-virtio-6.1 \
-accel tcg,tb-size=1024 \
-boot menu=on \
-m 4096 \
-device virtio-blk-ccw,drive=drive0,bootindex=0 \
-drive if=none,media=cdrom,file=./alpine-standard-3.14.2-s390x.iso,id=drive0 \
-device virtio-net-ccw,mac=F2:B4:FB:A4:6A:93,netdev=net0 \
-netdev user,id=net0,hostfwd=tcp::2223-:22 \
-rtc base=localtime \
-serial mon:stdio \
-display none

The login is root and no password. You will then need to build the swi-prolog package (either manually or from the APKBUILD).

You will need to run setup-alpine (and don't select any disk to install to and it will run from RAM). Then in the /etc/apk/repositories file, comment out the line that contains v3.14.2/main, and uncomment the 3 lines that contain edge. After that run apk upgrade since some of the dependencies are only available in edge.

If you want to use the APKBUILD, you need to follow https://wiki.alpinelinux.org/wiki/Creating_an_Alpine_package, but clone https://gitlab.alpinelinux.org/ktprograms/aports and checkout the swi-prolog branch. (Then run abuild -r in the testing/swi-prolog folder).

If you want to manually build it, this will install the needed dependencies (Alpine has some different names): apk add alpine-sdk cmake db-dev gmp-dev libarchive-dev libedit-dev libunwind-dev libxext-dev libice-dev libjpeg-turbo-dev libxinerama-dev libxft-dev libxpm-dev libxt-dev ncurses-dev openssl-dev ossp-uuid-dev pcre-dev readline-dev samurai unixodbc-dev yaml-dev zlib-dev

The APKBUILD has information on build commands used and other stuff if you need it.

ktprograms commented 2 years ago

Seems some term is corrupted during exception handling

I just wonder why t2 is deRef'ed and then it's pointer is immediately assigned to w2

JanWielemaker commented 2 years ago

I just wonder why t2 is deRef'ed and then it's pointer is immediately assigned to w2

That is normal in (SWI-)Prolog. A term may be a reference link (which happens if two variables are unified), so to get at the term we first need to dereference it. SWI-Prolog uses pointers to terms, so next we need to get the value of the pointer. Most Prolog systems implement variables as a self reference. That makes some stuff easier. In SWI-Prolog's way though, we can put annotations on the variables. That is practical for many of the term analysis primitives.

ktprograms commented 2 years ago

Oh, I see. Derefence as in follow the pointer chain not deallocate the memory. Nevermind then.

JanWielemaker commented 2 years ago

I can confirm that it reproduces using qemu. So far, so good :smile:

JanWielemaker commented 2 years ago

The issue is fixed with SWI-Prolog/swipl-devel@e887b987d54c03992106ed8caac88a1609c77e28. It is now also clear why it just segfaults: the s390x is not supported by glibc stack unwinding API. The bug is platform independent and it is quite a miracle that it took so long to find a platform for it to surface ...

SWI-Prolog/swipl-devel@d769fa39708d17974f948c9ffc1b684f24ee8675 also fixes an s390x issue, although that is not of much practical value.

Thanks for your patience.

brebs-gh commented 2 years ago

The one test failure remaining with swi-prolog 8.5.2 on Alpine is on x86 (32-bit) architecture:

70/73 Test #70: bdb:bdb ..........................***Exception: SegFault  0.09 sec
% PL-Unit: bdb 
SWI-Prolog [thread 1 (main) at Sat Nov 13 15:39:01 2021]: received fatal signal 11 (segv)
JanWielemaker commented 2 years ago

I'm afraid this will be a "Won't fix". I can reproduce it inside an i386/alpine Docker. It crashes inside the BDB DB open call. Unfortunately GDB doesn't produce a backtrace for this and thus is is really hard to say anything sensible. Looks more like a BDB bug than a SWI-Prolog bug. I propose to drop bdb for this target. It isn't a very important package anyway.

ktprograms commented 2 years ago

Ok, thanks for the info.

@brebs-gh I think the way to go would be to remove db-dev from makedepends, then right below the makedepends="..., add this line: [ ! "$CARCH" = "x86" ] && makedepends="$makedepends db-dev"

brebs-gh commented 2 years ago

All done, closing issue.