SWI-Prolog / swipl-devel

SWI-Prolog Main development repository
http://www.swi-prolog.org
Other
974 stars 175 forks source link

SEGV in compileArgument() during build while processing docs(?) #1107

Open he32 opened 1 year ago

he32 commented 1 year ago

Trying to build swi-prolog on NetBSD/macppc 10.0_BETA fails, with the following process getting a SEGV:

../src/swipl -x /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-8.0.2/build/man/pldoc2tex -- --source=/usr/pkgsrc/lang/swi-prolog-lite/work/swipl-8.0.2/man --summaries --out=lib/clpfdlib.tex --lib=clpfd --module=clpfd 'lib/clpfdlib.md'

It wasn't particularly easy to find how to configure it for a debug build, but I managed to hack it to produce a debug version eventually. The GDB session reveals that it looks like it's trying to store into unmapped memory(?)

% env LD_LIBRARY_PATH=../src gdb ../src/swipl
GNU gdb (GDB) 11.0.50.20200914-git
...
(gdb) run -x /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-8.0.2/build/man/pldoc2tex -- --source=/usr/pkgsrc/lang/swi-prolog-lite/work/swipl-8.0.2/man --summaries --out=lib/clpfdlib.tex --lib=clpfd --module=clpfd 'lib/clpfdlib.md'
Starting program: /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-8.0.2/build/src/swipl -x /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-8.0.2/build/man/pldoc2tex -- --source=/usr/pkgsrc/lang/swi-prolog-lite/work/swipl-8.0.2/man --summaries --out=lib/clpfdlib.tex --lib=clpfd --module=clpfd 'lib/clpfdlib.md'
[New LWP 19586 of process 29115]

Thread 1 "" received signal SIGSEGV, Segmentation fault.
0xfdd5a7f0 in compileArgument (arg=<optimized out>, arg@entry=0xfc8c87c0, 
    where=where@entry=1, ci=ci@entry=0xffffc894, 
    __PL_ld=__PL_ld@entry=0xfded4bd8 <PL_local_data>)
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-8.0.2/src/pl-comp.c:2256
2256          Output_n(ci, c, p, n+1);
(gdb) where
#0  0xfdd5a7f0 in compileArgument (arg=<optimized out>, arg@entry=0xfc8c87c0, 
    where=where@entry=1, ci=ci@entry=0xffffc894, 
    __PL_ld=__PL_ld@entry=0xfded4bd8 <PL_local_data>)
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-8.0.2/src/pl-comp.c:2256
#1  0xfdd641c4 in compileClause (cp=0xffffcd7c, head=<optimized out>, 
    body=0xfc94d2c4, proc=0xfcf65ee0, module=0xfcf851e0, 
    warnings=<optimized out>, __PL_ld=0xfded4bd8 <PL_local_data>)
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-8.0.2/src/pl-comp.c:1645
#2  0x6c206e61 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)  
(gdb) list
2251        } else
2252        { Word p = addressIndirect(*arg);
2253          size_t n = wsizeofInd(*p);
2254          int c = (where & A_HEAD) ? H_STRING : B_STRING;
2255    
2256          Output_n(ci, c, p, n+1);
2257          return TRUE;
2258        }
2259      }
2260    
(gdb) p c
$5 = <optimized out>
(gdb) p where
$6 = 1
(gdb) p A_HEAD
$7 = 1
(gdb) p H_STRING
$8 = H_STRING
(gdb) p n
$9 = 7186
(gdb) p ci
$10 = (compileInfo *) 0xffffc894
...
(gdb) i reg
...
r8             0xffffeffc          4294963196
...
pc             0xfdd5a7f0          0xfdd5a7f0 <compileArgument+412>
...
(gdb) x/i 0xfdd5a7f0
=> 0xfdd5a7f0 <compileArgument+412>:        stw     r7,4(r8)
(gdb) p/x 0xffffeffc+4
$12 = 0xfffff000
(gdb) x/x 0xfffff000
0xfffff000:       Cannot access memory at address 0xfffff000
(gdb) x/x 0xffffeffc
0xffffeffc:       0x4974205f
(gdb) 
JanWielemaker commented 1 year ago

Thanks for the report. First step seems to try using version 9.0.4. Lots of issues where fixed since. If that still crashes try using the debug build (CMAKE_BUILD_TYPE=Debug, see CMAKE.md).

he32 commented 1 year ago

Thanks for the hint. This uncovers a new (undocumented?) dependency:

/usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/packages/sweep/sweep.c:33:10: fatal error: emacs-module.h: No such file or directory
   33 | #include <emacs-module.h>
      |          ^~~~~~~~~~~~~~~~
compilation terminated.

I'll rummage around and see where I can satisfy it.

JanWielemaker commented 1 year ago

Part of GNU emacs. If this file does not exist however, it should simply not try to compile the sweep package. See packages/sweep/CMakeLists.txt

he32 commented 1 year ago

Yes. Part of the problem is I'm trying to fit this into NetBSD's pkgsrc, and even though I have emacs installed, and cmake possibly detects that, the actual build happens in a restricted environment where the required packages have to be "properly declared". Trying to sort out what the magic is. For now I've hacked the given CMakeLists.txt file to get around this hurdle for now.

he32 commented 1 year ago

It looks like this bug is also present in 9.0.4, ref:

(gdb) run -f none --no-packs -x /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/build/man/pldoc2tex -- --source=/usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/man --out=lib/clpfdlib.tex --lib=clpfd --module=clpfd --summaries lib/clpfdlib.md
Starting program: /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/build/src/swipl -f none --no-packs -x /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/build/man/pldoc2tex -- --source=/usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/man --out=lib/clpfdlib.tex --lib=clpfd --module=clpfd --summaries lib/clpfdlib.md
[New LWP 12763 of process 21297]

Thread 1 "" received signal SIGSEGV, Segmentation fault.
0xfdcdad44 in compileArgument___LD (
    __PL_ld=__PL_ld@entry=0xfdea4f08 <PL_local_data>, arg=<optimized out>, 
    arg@entry=0xfc597f50, where=where@entry=1, ci=ci@entry=0xffffd044)
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/src/pl-comp.c:2586
2586          Output_n(ci, c, p, n+1);
(gdb) where
#0  0xfdcdad44 in compileArgument___LD (
    __PL_ld=__PL_ld@entry=0xfdea4f08 <PL_local_data>, arg=<optimized out>, 
    arg@entry=0xfc597f50, where=where@entry=1, ci=ci@entry=0xffffd044)
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/src/pl-comp.c:2586
#1  0xfdce48e4 in compileClauseGuarded___LD (
    __PL_ld=__PL_ld@entry=0xfdea4f08 <PL_local_data>, ci=ci@entry=0xffffd044, 
    cp=cp@entry=0xffffd1dc, head=head@entry=0xfc5f7b8c, 
    body=body@entry=0xfc5f7b90, proc=proc@entry=0xfd80d230, 
    module=module@entry=0xfc9bc410, warnings=warnings@entry=741, 
    flags=<optimized out>, flags@entry=2)
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/src/pl-comp.c:1922
#2  0xfdce572c in compileClause___LD (__PL_ld=0xfdea4f08 <PL_local_data>, 
    cp=0xffffd1dc, head=0xfc5f7b8c, body=0xfc5f7b90, proc=0xfd80d230, 
    module=0xfc9bc410, warnings=741, flags=2)
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/src/pl-comp.c:1827
#3  0x6973206c in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) p ci
$1 = (compileInfo *) 0xffffd044
(gdb) p c
$2 = <optimized out>
(gdb) p p
$3 = (Word) 0xfc4f6464
(gdb) p n
$4 = 7196
(gdb) i reg
r0             0xfdcdacd0          4258114768
r1             0xffffcf10          4294954768
r2             0xfdee1008          4260237320
r3             0xffffd044          4294955076
r4             0xfdea4b00          4259990272
r5             0x1                 1
r6             0x41d8              16856
r7             0x2c20616e          740319598
r8             0xfffff000          4294963200
r9             0xfffff004          4294963204
r10            0xfc4f9304          4233073412
r11            0xfdcaa100          4257915136
r12            0x48028448          1208124488
r13            0x1828030           25329712
r14            0x6808d             426125
r15            0x4                 4
r16            0x4                 4
r17            0xfdea4f08          4259991304
r18            0xfdc77edc          4257709788
r19            0xffffd1dc          4294955484
r20            0xfdea4f28          4259991336
r21            0x1                 1
r22            0xffffd044          4294955076
r23            0xffffcf28          4294954792
r24            0x1                 1
r25            0xffffcf24          4294954788
r26            0xffffd11c          4294955292
r27            0xfdc77edc          4257709788
r28            0x7074              28788
r29            0xfc4f6464          4233061476
r30            0xfdea5730          4259993392
r31            0x1c1d              7197
pc             0xfdcdad44          0xfdcdad44 <compileArgument___LD+404>
msr            <unavailable>
cr             0x48028448          1208124488
lr             0xfdcdacd0          0xfdcdacd0 <compileArgument___LD+288>
ctr            0xa3b               2619
xer            0x20000000          536870912
fpscr          0xfff80000          -524288
vscr           <unavailable>
vrsave         <unavailable>
(gdb) x/i 0xfdcdad44
=> 0xfdcdad44 <compileArgument___LD+404>:       stw     r7,-4(r9)
(gdb) p/x 0xfffff004-4
$5 = 0xfffff000
(gdb) x/x 0xfffff000
0xfffff000:     Cannot access memory at address 0xfffff000
(gdb) x/x 0xffffefff
0xffffefff:     Cannot access memory at address 0xfffff000
(gdb) x/x 0xffffeff0
0xffffeff0:     0x20657870
(gdb) x/x 0xffffeffc
0xffffeffc:     0x0a757365
(gdb) 
JanWielemaker commented 1 year ago

Hmm. Doesn't make much sense to me. This code has been in use for many years with many different compilers and executed using valgrind, AddressSanitizer, etc. I'd first try to build using the Debug built type. At least that should give more details (hoping the problem persists).

he32 commented 1 year ago

CMAKE_BUILD_TYPE=Debug

Did that, and, annoyingly, this doesn't reproduce the problem. I think this means that this went from a swi-prolog issue to being a gcc compiler issue for this platform (32-bit ppc), with gcc version 10.3.0 (nb1 20210411).

Not sure that made it that much easier for me, but probably good news at your end.

Or... Maybe not. I built swipl with clang version 15.0.7, and got the same problem:

[ 81%] Generating lib/clpfdlib.tex
cd /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/build/man && ../src/swipl -f none --no-packs -x /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/build/man/pldoc2tex -- --source=/usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/man --out=lib/clpfdlib.tex --lib=clpfd --module=clpfd --summaries lib/clpfdlib.md
[1]   Segmentation fault      ../src/swipl -f none --no-packs -x /usr/pkgsrc...
*** [man/lib/clpfdlib.tex] Error code 139

Need to tweak the build to produce debug info by default, and re-do, to verify that it bombs at the same place.

JanWielemaker commented 1 year ago

Gets nasty. Yes, two compilers doing it wrong sounds unlikely. Alignment issues come to mind. On the other hand, we also have arm, which passes for 32 and 64 bits (Linux). Note that we also have Debian which passes all platforms including ppc AFAIK. Debian testing is at gcc 12 though.

I guess some good old print statements may shine some light ... Use Sdprintf() that otherwise has the same signature as printf() (with some extensions) to print debug output.

Please keep me posted.

he32 commented 1 year ago

The debug with clang build bombed slightly elsewhere:

[ 76%] Build home/library/clp/INDEX.pl

SWI-Prolog [thread 2 (gc) at Mon Jan 30 21:42:21 2023]: received fatal signal 11 (segv)
[1]   Segmentation fault (core dumped) ./swipl -f none --no-packs -t halt --home=/usr...
--- home/library/clp/INDEX.pl ---
*** [home/library/clp/INDEX.pl] Error code 139

Will dig out details later.

he32 commented 1 year ago

Here's a bit more about what this is about:

--- src/CMakeFiles/library_index_library_clp.dir/all ---
[ 76%] Build home/library/clp/INDEX.pl
cd /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/build/src && ./swipl -f none --no-packs -t halt --home=/usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/build/home -q -g "make_library_index('/usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/build/home/library/clp')" --

SWI-Prolog [thread 2 (gc) at Tue Jan 31 18:37:21 2023]: received fatal signal 11 (segv)
--- src/CMakeFiles/library_index_library_clp_always.dir/all ---
--- home/library/clp/__INDEX.pl ---
--- src/CMakeFiles/library_index_library_clp.dir/all ---
[1]   Segmentation fault (core dumped) ./swipl -f none --no-packs -t halt --home=/usr...
*** [home/library/clp/INDEX.pl] Error code 139
...
bramley: {309} gdb swipl swipl.core
GNU gdb (GDB) 11.0.50.20200914-git
...
Reading symbols from swipl...
[New process 5599]
[New process 28452]
Core was generated by `swipl'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  fetchop (PC=0xfd6c3440)
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/src/pl-comp.h:93
93        if ( unlikely(op == D_BREAK) )
[Current thread is 1 (process 5599)]
(gdb) i thread
  Id   Target Id         Frame 
* 1    process 5599      fetchop (PC=0xfd6c3440)
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/src/pl-comp.h:93
  2    process 28452     0xfda10ea8 in kill () from /usr/lib/libc.so.12
(gdb) thread 2
[Switching to thread 2 (process 28452)]
#0  0xfda10ea8 in kill () from /usr/lib/libc.so.12
(gdb) where
#0  0xfda10ea8 in kill () from /usr/lib/libc.so.12
#1  0xfde0d538 in sigCrashHandler (sig=11)
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/src/os/pl-cstack.c:1081
#2  0xfdd29b6c in alt_segv_handler (sig=11)
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/src/pl-setup.c:778
#3  <signal handler called>
#4  lookupHTable___LD (__PL_ld=0xfcec8000, ht=0x0, name=0x19385)
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/src/os/pl-table.c:517
#5  0xfdce01b8 in lookupModule___LD (__PL_ld=0xfcec8000, name=103301)
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/src/pl-modul.c:139
#6  0xfddc023c in PL_predicate (name=<optimized out>, arity=<optimized out>, 
    module=0xfde2baf5 "system")
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/src/pl-fli.c:4178
#7  0xfddc0164 in _PL_predicate (name=<optimized out>, arity=<optimized out>, 
    module=<optimized out>, bin=0xfdea4480 <PL_global_data+3192>)
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/src/pl-fli.h:299
#8  0xfdd57fb4 in GCmain (closure=<optimized out>)
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/src/pl-thread.c:6527
#9  0xfd9ae204 in ?? () from /usr/lib/libpthread.so.1
#10 0xfda7e2b0 in __mknod50 () from /usr/lib/libc.so.12
Backtrace stopped: frame did not save the PC
(gdb) up 4
#4  lookupHTable___LD (__PL_ld=0xfcec8000, ht=0x0, name=0x19385)
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/src/os/pl-table.c:517
517       acquire_kvs(ht, kvs);
(gdb) p/x ht
$1 = 0x0
(gdb) up
#5  0xfdce01b8 in lookupModule___LD (__PL_ld=0xfcec8000, name=103301)
    at /usr/pkgsrc/lang/swi-prolog-lite/work/swipl-9.0.4/src/pl-modul.c:139
139       if ( (m = lookupHTable(GD->tables.modules, (void*)name)) )
(gdb) p PL_global_data.tables
$2 = {modules = 0x0}
(gdb) 

Looks like a "simple" null pointer de-reference, but ... why would the "modules" not be filled with actual data?

JanWielemaker commented 1 year ago

Rather odd. Thread 2 is the global object (atom,clause) garbage collector thread. This looks like the startup of this thread, trying to find the predicate to call. The module table is shared and must already have been created long before by the main thread. So, this makes no sense unless something writes to the wrong address. The way forward is probably to run under gdb, set a breakpoint at the place the table is created (pl-modul.c:292 in de dev source) and then a hardware watchpoint on PL_global_data.tables.modules.

Looks like there is something fundamentally wrong with this platform. I vaguely recall there were issues in the POSIX thread implementation of NetBSD that caused problems with SWI-Prolog, but that is rather long ago.

What is the issue in thread 1? Is PC invalid? Or is it decode() which does a lookup in a global array. The == test can't crash :smile: That too is pretty weird for running a simple Prolog program. I don't think I've ever seen fetchop() crash.

he32 commented 1 year ago

I beleive I have news on this. It turns out that since I do this build on a dual-CPU macppc, I also did the build with "make -j 2". However, it now looks like the build setup of swi-prolog has not made all inter-dependencies between the different build steps explicit, so that this is safe.

I have now done a build with gcc 12, which initially crashed, and setting pkgsrc's MAKE_JOBS=1 in /etc/mk.conf made the build succeed. I've also reverted back to the in-tree gcc 10, and reducing the build job parallelism down to 1 makes that build succeed as well.

JanWielemaker commented 1 year ago

Thanks for all the work. Concurrent dependency issues are not that likely to cause this. It doesn't match the crashes and concurrent builds are used on almost all platforms these days. What is more likely is that we are dealing with internal thread issues that become apparent because you run two make jobs, which typically leads to more than 2 threads. No this has also been tested in many extreme scenarios. I'm not claiming there is no bug related to thread synchronization, but this is rather extreme. Especially the lost module table is really weird.

What happens on ctest, in particularly depending on the number of jobs? That should work pretty much ok, except for some socket allocation issue that causes http:proxy tests to fail occasionally because it can run concurrently with another test on sockets and then the tests sometimes use each other's sockets :cry: