Closed KungFuJesus closed 3 years ago
Thanks @KungFuJesus. I'm not quite sure how to debug this, without access to a ppc64 machine. Does gentoo have any common ones for this purpose?
I'm not quite sure if they do. Adding to this complexity is the fact that the PPC64 machine I'm using is explicitly big endian, whereas a lot of the modern variants of PPC64 are bi-endian and tend to prefer little endian.
Given that the code generator is just vanilla C and not emitting anything like inline assembly, I'd say the issue is likely endianness. If you need to throw me a patch to test, I should able to. It might even be possible that the bug is endian independent but little endian happens to mask it (this issue resembles an issue with substrings where an index is cut short somewhere). I've seen unioned types with mismatch sizes create similar issues like this, where the compiler lobs off the wrong bytes when the union'd type gets aliased.
when i read the gentoo bug report a couple days ago, it appeared the issue might also happen on x86_64, fwiw. anyone tested that already ?
I have an x64 machine and have not been able to reproduce it on there, so it's almost certainly endianness related. I don't know if it's a compiler bug or not, but I've observed when you have mismatched union types, the code that gets generated when casted to the larger type seems to drop the wrong bytes. One such example I can point to directly:
https://github.com/sm64pc/sm64ex/issues/426
I could see this interpreting a 32 bit integer as 0 by taking the wrong half of the word.
in that case the best approach might be if you try to git bisect the ragel commits between 6.10 and master on your machine, looking at the specific commit that introduced the issue would probably allow an educated guess where the culprit is. what makes this a lot harder than it should be is that ragel depends on colm which is in a separate repo.
an alternative might be to try to get an adelie linux iso for ppc64be and boot it in qemu. (adelie is the only distro i know that has binary builds for ppc64be).
there's also the GCC buildfarm developers can get access to which has a huge variety of build hosts.
Not sure it helps or not but in Gentoo the colm ebuild is separated from ragel, and is fixed in both builds to version 0.13.0.7. I should be able to keep that variable constant.
Ok, so I did a bisect (for some of the commits the tree was in a very strange state, seemingly just wiki documentation), but I finally arrived at what I think is the troublesome commit:
9a66576860ef04570cfba6099c8a1a10b8238114
I should also add that's the first revision where the result for that line became uncompilable. I noticed some form of string corruption in form of the spacing around the equals disappear in a revision before that one (which maybe is the real source of the bug). Maybe after work I could bisect to find when that started occurring.
Hi @KungFuJesus thank you for the investigation! A lot of the time people don't follow up on these issues, so I was happy to see some activity here.
And yup, the ragel codebase has gone through some upheaval between 6 and 7. It's a bit of a long story :)
So that commit is the one that defaults to the colm-based frontend. I'm suspecting the bug is purely in colm. Would you be able to run the colm test suite on master? Depending on the result of that, I should also be able to craft some colm programs that exercise the bug. But want to see the test results first. I can also imagine some hacks to ragel that might reveal the bug, but need to leave that for tomorrow. (I'm PST).
cd test/colm.d && make && ../runtests
Hmm, colm.d directory appears to be missing under test.
Oh it's in the colm repo, I'll check that out. I'm using the distro provided ebuild for that, but I'm sure I can grab the exact version from a tag and try out the test.
Except...the one version that the distro provides (0.13.0.7) seems to have no tag...hmm. I'll try the version just before it and hope they are close enough.
Ok, so there wasn't a colm.d directory but I did do runtests, attached is the output.
Here's the output from the master branch, with surprisingly one segfault: colmout-master.log
Here are the resulting diffs for those: diffs.tar.gz
Looks as though the segfault is due to a linked list corruption that happens prior to a "print_kid" call:
(gdb) r < tcontext1-0.in > tcontext1.out
Starting program: /home/adam/colm/test/colm.d/working/tcontext1 < tcontext1-0.in > tcontext1.out
Program received signal SIGSEGV, Segmentation fault.
0x0000000010030730 in print_kid (prg=0x100632a0, sp=0x10073450, print_args=0x3fffffffc618, kid=0x3fffffffc520) at print.c:325
325 ignore = ignore->next;
(gdb) bt
#0 0x0000000010030730 in print_kid (prg=0x100632a0, sp=0x10073450, print_args=0x3fffffffc618, kid=0x3fffffffc520) at print.c:325
#1 0x0000000010030e78 in colm_print_tree_args (prg=0x100632a0, sp=0x10073448, print_args=0x3fffffffc618, tree=0x3ffff7cd5178) at print.c:437
#2 0x00000000100326dc in colm_print_tree_collect_xml (prg=0x100632a0, sp=0x10073448, collect=0x3fffffffc710, tree=0x3ffff7cd5178, trim=0) at print.c:762
#3 0x0000000010015a50 in tree_to_str_xml (prg=0x100632a0, sp=0x10073448, tree=0x3ffff7cd5178, trim=0, attrs=0) at bytecode.c:161
#4 0x000000001001d040 in colm_execute_code (prg=0x100632a0, exec=0x3fffffffe938, sp=0x10073448, instr=0x1006080e <parser_rootCode+46> "\243\001\276U\265\021") at bytecode.c:1670
#5 0x0000000010017874 in colm_execute (prg=0x100632a0, exec=0x3fffffffe938, code=0x100607e0 <parser_rootCode> "\377\b\002") at bytecode.c:582
#6 0x0000000010002f9c in colm_run_program2 (prg=0x100632a0, argc=1, argv=0x3fffffffef28, argl=0x0) at program.c:222
#7 0x0000000010003038 in colm_run_program (prg=0x100632a0, argc=1, argv=0x3fffffffef28) at program.c:231
#8 0x000000001000265c in main (argc=1, argv=0x3fffffffef28) at working/tcontext1.c:793
(gdb) p ignore
$1 = (kid_t *) 0x300000000
(gdb) p *ignore
Cannot access memory at address 0x300000000
(gdb) p ignore->next
Cannot access memory at address 0x300000008
(gdb) up
#1 0x0000000010030e78 in colm_print_tree_args (prg=0x100632a0, sp=0x10073448, print_args=0x3fffffffc618, tree=0x3ffff7cd5178) at print.c:437
437 print_kid( prg, sp, print_args, &kid );
(gdb) l
432 term.next = 0;
433
434 kid.tree = tree;
435 kid.next = &term;
436
437 print_kid( prg, sp, print_args, &kid );
438 }
439 }
440
441 void colm_print_null( program_t *prg, tree_t **sp,
(gdb) p kid
$2 = {tree = 0x3ffff7cd5178, next = 0x3fffffffc530}
(gdb) p kid->next
$3 = (struct colm_kid *) 0x3fffffffc530
(gdb) p kid->next->next
$4 = (struct colm_kid *) 0x0
Oh looks like something is badly broken somewhere. I'm suspecting there is something in the parsing that isn't friendly to big-endian. It's resulting in corrupt trees that sometimes don't print properly and sometimes segfault.
I'm wondering if there are other big-endian machines we can try this on. I used to use university machines to do that testing. Fastest way to get this fixed is for me to get on a machine where it exhibits.
Well...to get a little bit more diversity, perhaps, I can give you an account to a Solaris 11 based SPARC-T4 I have running. However, it's quite loud and has the added complexity that usually ensues when trying to build packages on Solaris, no matter how few dependencies (usually some frustration in autotools/automake/autoconf + GCC). Perhaps I can try to see if it builds on it, first, then I can get you a shell on it
I'm wondering if there are other big-endian machines we can try this on.
as mentioned there's GCC compile farm. https://gcc.gnu.org/wiki/CompileFarm
but you could also use a cheap mips device like a router, or qemu-system-mips running aboriginal linux system images, for example.
I'll register with the compile farm, thanks for the tip!
As for the MIPS route, they may not produce all endianness symptoms with big endian, as I believe a lot of embedded stuff is 32 bit. The linked issue above from sm64ex does not effect ppc32 big endian (not that I think we're seeing that issue here, just an example).
Just got access to the GCC compile farm and the problem exhibits on both SPARC and PPC64, exactly as you have shown above in the testcase output log.
@KungFuJesus @rofl0r Fixed the bug over in colm, and I confirmed that it fixes the issue generating the harfbuzz source. I will release a new version of ragel 7.0.3 that expects the version of colm with the fix. Thank you both for your work reporting and investigating this!
for the record, this is the commit fixing it: https://github.com/adrian-thurston/colm/commit/472a89059b9bec9f40f54d7ed5c312786c69198c
Ahh, so a similar class of issue to the automatic type casting and aliasing with non-uniform union members. Basically it's lobbing off the wrong half of the word.
I do sort of wonder if this is a gray area in the C specification or if this a very platform specific behavior defined by the ABI (similar to how all characters are assumed to be unsigned rather than signed on POWER).
Thank you for the work here, all. FWIW, I'm happy (on behalf of Gentoo) to try facilitate access to exotic architectures if it's needed for upstreams!
I do sort of wonder if this is a gray area in the C specification or if this a very platform specific behavior defined by the ABI (similar to how all characters are assumed to be unsigned rather than signed on POWER).
this was just plain and simple UB (undefined behaviour). actually the current typecast the fix uses smells like UB too... but then i didn't read the actual type definitions.
@rofl0r there is a problem if we try to push/pop something bigger in size than (void*). Perhaps an debug-mode assert somewhere would be nice defensive move.
So it appears that 6.10 works fine, but 7.0.0.12 seems to drop a "= 1" after a variable declaration. Please see this gentoo bug report:
https://bugs.gentoo.org/763621