codereport / jsource

J Language Source Code. Livestream links ⬇️
https://www.youtube.com/playlist?list=PLVFrD1dmDdvfVhYLU_iKkV67X9XqCJLWe
Other
38 stars 20 forks source link

Segfaults on test g420stch.ijs #166

Open herwinw opened 3 years ago

herwinw commented 3 years ago

It might be something weird on my machine, since I haven't seen these show up in the CI runs. My build consistently fail on this one test, regardless whether I build with GCC or Clang, regardless whether it's a debug or release build. It has been like this since my first checkout, so recent refactors are not likely to be the culprit.

$ build/jsrc/Debug/jconsole build/test/Debug/g420stch.ijs
Hello YouTube Viewers, all 22 of you!
see: tsu_notes, tsu_usage, tsu_pacman, and tsu_jd

   RUN  ddall  NB. report scripts that fail
   RECHO ddall NB. echo script names as run and final count of failures

Segmentation fault

Removing this single line from the test stops the segfault

'length error' -: ,.&.>/ etx 1 2;3 4 5

Replacing the expected error message with something else causes the test to fail, but it does prevent the segfault.

I've managed to reproduce in the console with the following code:

join=: ,.&.>/
join 1 2;3 4 5

That join is copied from jlibrary/system/util/pm.ijs, using the operation directly does not result in a segfault.

GDB shows the location of the error

0x00007ffff73c3cab in jtgaf (jt=0x446000, blockx=6) at ../jsrc/m.c:1066
1066                    jt->mfree[-PMINL + 1 + blockx].pool = AFCHAIN(z);  // remove & use the head of the free chain

(gdb) info locals
pushp = 0x6c8db0
z = 0x40
mfreeb = 259200
n = 128

Looking at jt->mfree, there is one very suspicious looking entry, that contains the 0x40 value seen in the z variable:

(gdb) print jt->mfree
$4 = {{ballo = 614592, pool = 0x921cc0}, {ballo = 259200, pool = 0x40}, {ballo = 985344, pool = 0x6c9440}, {ballo = 922112, pool = 0x5f6980}, {ballo = 977920, pool = 0x6795c0}}

PMINL is defined as 6 (jsrc/m.h), the argument blockx equals 6, so the calculation would indeed get the second entry. We've gotten this block a few lines earlier in the m.c file:

z = jt->mfree[-PMINL + 1 + blockx].pool;  // tentatively use head of free list as result - normal case, and even if
                                          // blockx is out of bounds will not segfault

Yes, that comment made me grin.

The segfault itself is caused by the AFCHAIN macro, also in m.c:

#define AFCHAIN(a) ((a)->kchain.chain)            // the chain field, when the block is not allocated

So I guess something returns a nullptr, the first 8 bytes of that something are interpreted as some data, the remainder is memory that is supposed to be usable.

Since this is happening to me with multiple compilers, and nobody else appears to be having this problem, it might be a problem with my libc or something like that.

I'll try to look into it deeper in the next few days

juntuu commented 3 years ago

I also investigated this a little, it's been known issue on osx for some time: https://github.com/codereport/jsource/pull/23#issue-561300714

I bisected one day on the failing case and (at least on osx) the issue was likely introduced in 46c45895c82f4f8b3cbcf256c5d3e028fd768244

herwinw commented 3 years ago

I tried compiling the parent of that commit, but that didn't solve the issue, guess it's not completely the same issue.

herwinw commented 3 years ago

Somewhere between commit ea03e122 and commit 88eb2d83 it has been changed from a segfault into an illegal exception. That first commit is the last one of the original source.

juntuu commented 3 years ago

Might be undefined behavior somewhere, there are at least some "sequence point" warnings in the build. Also got some uninitialized and garbage value warnings with clang analyzer.

herwinw commented 3 years ago

Some more experiments:

So at least we've got one thing that looks kind of suspicious.

herwinw commented 3 years ago

As a side note: I'm pretty much using this ticket as a whiteboard, subscribing to changes of it might get a bit spammy.

herwinw commented 3 years ago

Added a few extra lines to m.c

if (z) {                                               // allocate from a chain of free blocks
    jt->mfree[-PMINL + 1 + blockx].pool = AFCHAIN(z);  // remove & use the head of the free chain
    if (jt->mfree[-PMINL + 1 + blockx].pool == (A)0x40) {
        (void)1;
    }

This opens op the possibility for a breakpoint on the (void)1; line. This is indeed the location where the assignment happens (but of course not where the problem starts. A few lines from the debugger:

(gdb) p *z
$1 = {kchain = {k = 64, chain = 0x40, globalst = 0x40}, flag = 0, mback = {m = 7114136, back = 0x6c8d98, zaploc = 0x6c8d98}, tproxy = {t = 2, proxychain = 0x2}, c = 1, n = 1, r = 1, h = 1222, fill = 0, s = {1}}
(gdb) p jt->mfree
$2 = {{ballo = 614592, pool = 0x921cc0}, {ballo = 259072, pool = 0x40}, {ballo = 985344, pool = 0x6c9440}, {ballo = 922112, pool = 0x5f6980}, {ballo = 977920, pool = 0x6795c0}}
(gdb) p jt->mfree[0].pool
$3 = (A) 0x921cc0
(gdb) p *jt->mfree[0].pool
$4 = {kchain = {k = 6664000, chain = 0x65af40, globalst = 0x65af40}, flag = 131104, mback = {m = 4521984, back = 0x450000, zaploc = 0x450000}, tproxy = {t = 32, proxychain = 0x20}, c = 1, n = 36, r = 1, h = 1545, fill = 0, s = {36}}
(gdb) p *jt->mfree[1].pool
Cannot access memory at address 0x40
(gdb) p *jt->mfree[2].pool
$5 = {kchain = {k = 7117376, chain = 0x6c9a40, globalst = 0x6c9a40}, flag = 0, mback = {m = 4515848, back = 0x44e808, zaploc = 0x44e808}, tproxy = {t = 4, proxychain = 0x4}, c = 1, n = 10, r = 2, h = 148, fill = 0, s = {5}}
(gdb) p *jt->mfree[3].pool
$6 = {kchain = {k = 6254464, chain = 0x5f6f80, globalst = 0x5f6f80}, flag = 0, mback = {m = 4515856, back = 0x44e810, zaploc = 0x44e810}, tproxy = {t = 1, proxychain = 0x1}, c = 1, n = 416, r = 2, h = 136, fill = 0, s = {104}}
(gdb) p *jt->mfree[4].pool
$7 = {kchain = {k = 6787520, chain = 0x6791c0, globalst = 0x6791c0}, flag = 0, mback = {m = 4515912, back = 0x44e848, zaploc = 0x44e848}, tproxy = {t = 16, proxychain = 0x10}, c = 0, n = 45, r = 2, h = 80, fill = 0, s = {5}}
(gdb) info locals
pushp = 0x6c8da8
z = 0x721c80
mfreeb = 259072
n = 128

The definition of A (which is a pointer to struct AD) shows a few unions, it looks like we're using a single data type for both memory allocation information and values. The memory has a linked list, and somehow we've ended up with a memory node that is actually the array [1]. We're trying to read the value of chain, but we actually read the value of k that is unioned in the same location. And 1 dimensional arrays have k = 64 = 0x40 (source).

juntuu commented 3 years ago

Played around a bit with different sanitizer flags: