Closed brebs-gh closed 2 years ago
Please pass --output-on-failure
to ctest so we get some idea about what is going wrong. I got an update from @EricZinda that might address the mqi failure.
The ctest details can be seen on the arch-specific build logs, e.g. mqi on x86 is missing python3.
python3 is now added.
Created an issue for protobufs at https://github.com/SWI-Prolog/contrib-protobufs/issues/14. I guess you can deal with the python3 dependency by adding this to the build requirements? Not sure what to make of the core test failure for s390x. Given that there is no error report suggests a C stack overflow. I know that one of these tests also fails when using AddressSanitizer with default stack sizes. Would it be easy to use ulimit
to raise the limit significantly?
@JanWielemaker The download link for the SWI-Prolog source isn't working anymore (I'm using https://www.swi-prolog.org/download/devel/src/swipl-8.5.0.tar.gz
) and the GitHub tarballs don't seem to have the git submodules included.
Oops. My cleanup of old source archives was a bit too radical :cry: They are back again. It might take a little for the CDN to pick this up. eu.swi-prolog.org should have it immediately. Note there is now 8.5.1 :smile:
@JanWielemaker I'm doing the best I can to figure out the s390x segfault, but QEMU doesn't implement ptrace. When reading the core dump with gdb and running bt
, it says Backtrace stopped: previous frame identical to this frame (corrupt stack?)
, so I guess you might be right about stack problems.
Thanks so far.
That is the nasty thing about stack overflows :cry: AFAIK there is no portable way to deal with these elegantly. Maybe I should give non-portable options a go ... Anyway, why not simply use ulimit
to raise the limit and see whether that helps? If the crash vanishes we know this was the problem and I propose to simply raise the limit for running the tests if this is easy.
If that doesn't help, do you have an easy to follow recipe to get the Qemu environment that your using running?
Anyway, why not simply use ulimit to raise the limit and see whether that helps?
I'm trying to (with ulimit -s 16384
) but for some reason it's not changing. I think it might be a QEMU bug. I'll let you know when I figure out how to increase the stack size if that was the fix.
If that doesn't help, do you have an easy to follow recipe to get the Qemu environment that your using running?
I'm basically following https://wiki.alpinelinux.org/wiki/How_to_make_a_cross_architecture_chroot (which I wrote), then building the alpine package after setting up the build environment as shown in https://wiki.alpinelinux.org/wiki/Creating_an_Alpine_package.
Thanks. I'm afraid I'm a bit too busy to dive into this quickly. If nothing changes I'll give qemu a shot next week.
In the end I used QEMU full system emulation, and even setting ulimit -s unlimited
still causes a segfault.
If you need more info (gdb, etc), I should be able to provide it.
Unexpected. Maybe we should first figure out which test is to blame. You do that as follows:
ctest -V -R core
to get the commandline executed. You can interrupt the test itself immediately as we only want the commandline.-q
to make it more verbose. That prints the executed test files, the unit test bodies and a "." for each test that succeeded. Identify the failing test file.../src/Tests/core
. Find the test unit and count the dots until you have the test.src/swipl ../src/Tests/core/<file>
?- run_tests(unit:test).
Hopefully this reproduces. That should give some hints and makes reproducing a lot quicker. You can then run under gdb. You may also apply SWI-Prolog/swipl-devel@a375c6d0210d4ed62299815d96b79594820578ce, which may make the crash more verbose.
The segfaulting test is setup_call_cleanup:error_choice
. How can I run under gdb?
Also https://github.com/SWI-Prolog/swipl-devel/commit/a375c6d0210d4ed62299815d96b79594820578ce doesn't seem to change any output (in the CMake output sigaltstack is found)
Interesting. Thanks. Hmm. This doesn't smell like a C stack exhaustion. Possibly a corruption, though that is not that likely as well. For gdb, do
gdb --args swipl args ...
(gdb) run
<crash> (hopefully)
(gdb) bt all
Preferably use the debug build (see CMAKE.md), but the bug may vanish, in which case we have to live with the default build.
I've run it 3 times, and twice I got this backtrace:
?- run_tests(setup_call_cleanup:error_choice).
% PL-Unit: setup_call_cleanup:error_choice
Thread 1 "swipl" received signal SIGSEGV, Segmentation fault.
0x000003fffdc65c18 in do_unify___LD (__PL_ld=0x3fffde965b0 <PL_local_data>, t1=0x3fffd67e518, t2=0x276d16e5674010) at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-prims.c:276
276 deRef(t2); w2 = *t2;
(gdb) bt
#0 0x000003fffdc65c18 in do_unify___LD (
__PL_ld=0x3fffde965b0 <PL_local_data>, t1=0x3fffd67e518,
t2=0x276d16e5674010)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-prims.c:276
#1 0x000003fffdc6674c in raw_unify_ptrs___LD (
__PL_ld=0x3fffde965b0 <PL_local_data>, t1=0x3fffd67e518, t2=0x3fffd685000)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-prims.c:421
#2 0x000003fffdc66b4e in unify_ptrs___LD (
__PL_ld=0x3fffde965b0 <PL_local_data>, t1=0x3fffd67e518, t2=0x3fffd685000,
flags=3)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-prims.c:507
#3 0x000003fffdc6705c in can_unify (t1=0x3fffd67e518, t2=0x3fffd685000,
ex=547)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-prims.c:550
#4 0x000003fffdbb7ea4 in isCaughtInOuterQuery___LD (
__PL_ld=0x3fffde965b0 <PL_local_data>, qid=479, ball=0)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-wam.c:2005
#5 0x000003fffdbb894e in exception_hook___LD (
__PL_ld=0x3fffde965b0 <PL_local_data>, pqid=479, fr=513, catchfr_ref=0)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-wam.c:2128
#6 0x000003fffdbd78f8 in PL_next_solution___LD (
__PL_ld=0x3fffde965b0 <PL_local_data>, qid=479)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-vmi.c:5147
--Type <RET> for more, q to quit, c to continue without paging--
#7 0x000003fffdbb2c24 in call_term___LD (
__PL_ld=0x3fffde965b0 <PL_local_data>, mdef=0x3fffda19f60, goal=409)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-wam.c:800
#8 0x000003fffdbb2ea8 in callCleanupHandler___LD (
__PL_ld=0x3fffde965b0 <PL_local_data>, fr=0x3fffd685c70,
reason=FINISH_EXTERNAL_EXCEPT_UNDO)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-wam.c:847
#9 0x000003fffdbb2ffc in frameFinished___LD (
__PL_ld=0x3fffde965b0 <PL_local_data>, fr=0x3fffd685c70,
reason=FINISH_EXTERNAL_EXCEPT_UNDO)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-wam.c:866
#10 0x000003fffdbb9bc2 in discardChoicesAfter___LD (
__PL_ld=0x3fffde965b0 <PL_local_data>, fr=0x3fffd685c18,
reason=FINISH_EXTERNAL_EXCEPT_UNDO)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-wam.c:2450
#11 0x000003fffdbba2c4 in dbg_discardChoicesAfter___LD (
__PL_ld=0x3fffde965b0 <PL_local_data>, fr=0x3fffd685c18,
reason=FINISH_EXTERNAL_EXCEPT_UNDO)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-wam.c:2502
#12 0x000003fffdbda10a in PL_next_solution___LD (
__PL_ld=0x3fffde965b0 <PL_local_data>, qid=25)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-vmi.c:5344
#13 0x000003fffdc7f3c2 in query_loop (goal=32261, loop=1)
--Type <RET> for more, q to quit, c to continue without paging--
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-pro.c:147
#14 0x000003fffdc80442 in prologToplevel (goal=32261)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-pro.c:496
#15 0x000003fffddaf9e2 in PL_toplevel ()
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-fli.c:4558
#16 0x000002aa00000a38 in main (argc=2, argv=0x3fffffffcc8)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-main.c:143
But the first time I did it I got this backtrace: (ignore the no symbol all
part)
?- run_tests(setup_call_cleanup:error_choice).
Thread 1 "swipl" received signal SIGSEGV, Segmentation fault.
0x000003fffdc6d41a in PL_same_term___LD (__PL_ld=0x3fffde965b0 <PL_local_data>, T1=175, T2=0) at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-prims.c:2103
2103 deRef(t2);
(gdb) bt all
No symbol "all" in current context.
(gdb) bt
#0 0x000003fffdc6d41a in PL_same_term___LD (
__PL_ld=0x3fffde965b0 <PL_local_data>, T1=175, T2=0)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-prims.c:2103
#1 0x000003fffdbb8e4c in exception_hook___LD (
__PL_ld=0x3fffde965b0 <PL_local_data>, pqid=25, fr=149, catchfr_ref=0)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-wam.c:2168
#2 0x000003fffdbd78f8 in PL_next_solution___LD (
__PL_ld=0x3fffde965b0 <PL_local_data>, qid=25)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-vmi.c:5147
#3 0x000003fffdc7f3c2 in query_loop (goal=32261, loop=1)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-pro.c:147
#4 0x000003fffdc80442 in prologToplevel (goal=32261)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-pro.c:496
#5 0x000003fffddaf9e2 in PL_toplevel ()
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-fli.c:4558
#6 0x000002aa00000a38 in main (argc=2, argv=0x3fffffffcc8)
at /home/kt/aports/testing/swi-prolog/src/swipl-8.5.1/src/pl-main.c:143
It appears that the second backtrace (the 6 long one) happens when I start typing before the prompt shows up (but then delete the other input and type in the segfaulting code)
Thanks. Seems some term is corrupted during exception handling. I guess there are two ways out, one is to add more safety tests and hope this will show up on other platforms too and the other is for me to install the Qemu VM. At least, it is not a stack overflow.
I understand you use a full VM now. I found https://wiki.qemu.org/Documentation/Platforms/S390X. qemu-system-s390x is in Ubuntu. Which kernel image did you use?
I downloaded https://dl-cdn.alpinelinux.org/alpine/v3.14/releases/s390x/alpine-standard-3.14.2-s390x.iso, and ran it with this command line:
qemu-system-s390x \
-smp cpus=1,sockets=1,cores=1,threads=1 \
-machine s390-ccw-virtio-6.1 \
-accel tcg,tb-size=1024 \
-boot menu=on \
-m 4096 \
-device virtio-blk-ccw,drive=drive0,bootindex=0 \
-drive if=none,media=cdrom,file=./alpine-standard-3.14.2-s390x.iso,id=drive0 \
-device virtio-net-ccw,mac=F2:B4:FB:A4:6A:93,netdev=net0 \
-netdev user,id=net0,hostfwd=tcp::2223-:22 \
-rtc base=localtime \
-serial mon:stdio \
-display none
The login is root
and no password. You will then need to build the swi-prolog
package (either manually or from the APKBUILD).
You will need to run setup-alpine
(and don't select any disk to install to and it will run from RAM).
Then in the /etc/apk/repositories
file, comment out the line that contains v3.14.2/main
, and uncomment the 3 lines that contain edge
. After that run apk upgrade
since some of the dependencies are only available in edge.
If you want to use the APKBUILD, you need to follow https://wiki.alpinelinux.org/wiki/Creating_an_Alpine_package, but clone https://gitlab.alpinelinux.org/ktprograms/aports
and checkout the swi-prolog
branch. (Then run abuild -r
in the testing/swi-prolog
folder).
If you want to manually build it, this will install the needed dependencies (Alpine has some different names):
apk add alpine-sdk cmake db-dev gmp-dev libarchive-dev libedit-dev libunwind-dev libxext-dev libice-dev libjpeg-turbo-dev libxinerama-dev libxft-dev libxpm-dev libxt-dev ncurses-dev openssl-dev ossp-uuid-dev pcre-dev readline-dev samurai unixodbc-dev yaml-dev zlib-dev
The APKBUILD has information on build commands used and other stuff if you need it.
Seems some term is corrupted during exception handling
I just wonder why t2
is deRef
'ed and then it's pointer is immediately assigned to w2
I just wonder why
t2
isdeRef
'ed and then it's pointer is immediately assigned tow2
That is normal in (SWI-)Prolog. A term may be a reference link (which happens if two variables are unified), so to get at the term we first need to dereference it. SWI-Prolog uses pointers to terms, so next we need to get the value of the pointer. Most Prolog systems implement variables as a self reference. That makes some stuff easier. In SWI-Prolog's way though, we can put annotations on the variables. That is practical for many of the term analysis primitives.
Oh, I see. Derefence as in follow the pointer chain not deallocate the memory. Nevermind then.
I can confirm that it reproduces using qemu. So far, so good :smile:
The issue is fixed with SWI-Prolog/swipl-devel@e887b987d54c03992106ed8caac88a1609c77e28. It is now also clear why it just segfaults: the s390x is not supported by glibc stack unwinding API. The bug is platform independent and it is quite a miracle that it took so long to find a platform for it to surface ...
SWI-Prolog/swipl-devel@d769fa39708d17974f948c9ffc1b684f24ee8675 also fixes an s390x issue, although that is not of much practical value.
Thanks for your patience.
I'm afraid this will be a "Won't fix". I can reproduce it inside an i386/alpine Docker. It crashes inside the BDB DB open call. Unfortunately GDB doesn't produce a backtrace for this and thus is is really hard to say anything sensible. Looks more like a BDB bug than a SWI-Prolog bug. I propose to drop bdb for this target. It isn't a very important package anyway.
Ok, thanks for the info.
@brebs-gh I think the way to go would be to remove db-dev
from makedepends
, then right below the makedepends="...
, add this line: [ ! "$CARCH" = "x86" ] && makedepends="$makedepends db-dev"
All done, closing issue.
Hi, the new (draft) swi-prolog package for Alpine Linux (which uses Musl) shows some architecture-specific tests failing:
(In that merge request link, click on the green tick near "merge request pipeline" to show a drop-down of the architectures, then click on each architecture for the build log.)
The mqi:mqi test is mentioned on Discourse.