adrian-thurston / ragel

Ragel State Machine Compiler
MIT License
532 stars 46 forks source link

Ragel 7.0.0.12 fails to generate valid code for harbuzz 2.7.4 on PPC64 (big endian) #61

Closed KungFuJesus closed 3 years ago

KungFuJesus commented 3 years ago

So it appears that 6.10 works fine, but 7.0.0.12 seems to drop a "= 1" after a variable declaration. Please see this gentoo bug report:

https://bugs.gentoo.org/763621

adrian-thurston commented 3 years ago

Thanks @KungFuJesus. I'm not quite sure how to debug this, without access to a ppc64 machine. Does gentoo have any common ones for this purpose?

KungFuJesus commented 3 years ago

I'm not quite sure if they do. Adding to this complexity is the fact that the PPC64 machine I'm using is explicitly big endian, whereas a lot of the modern variants of PPC64 are bi-endian and tend to prefer little endian.

Given that the code generator is just vanilla C and not emitting anything like inline assembly, I'd say the issue is likely endianness. If you need to throw me a patch to test, I should able to. It might even be possible that the bug is endian independent but little endian happens to mask it (this issue resembles an issue with substrings where an index is cut short somewhere). I've seen unioned types with mismatch sizes create similar issues like this, where the compiler lobs off the wrong bytes when the union'd type gets aliased.

rofl0r commented 3 years ago

when i read the gentoo bug report a couple days ago, it appeared the issue might also happen on x86_64, fwiw. anyone tested that already ?

KungFuJesus commented 3 years ago

I have an x64 machine and have not been able to reproduce it on there, so it's almost certainly endianness related. I don't know if it's a compiler bug or not, but I've observed when you have mismatched union types, the code that gets generated when casted to the larger type seems to drop the wrong bytes. One such example I can point to directly:

https://github.com/sm64pc/sm64ex/issues/426

I could see this interpreting a 32 bit integer as 0 by taking the wrong half of the word.

rofl0r commented 3 years ago

in that case the best approach might be if you try to git bisect the ragel commits between 6.10 and master on your machine, looking at the specific commit that introduced the issue would probably allow an educated guess where the culprit is. what makes this a lot harder than it should be is that ragel depends on colm which is in a separate repo.

an alternative might be to try to get an adelie linux iso for ppc64be and boot it in qemu. (adelie is the only distro i know that has binary builds for ppc64be).

there's also the GCC buildfarm developers can get access to which has a huge variety of build hosts.

KungFuJesus commented 3 years ago

Not sure it helps or not but in Gentoo the colm ebuild is separated from ragel, and is fixed in both builds to version 0.13.0.7. I should be able to keep that variable constant.

KungFuJesus commented 3 years ago

Ok, so I did a bisect (for some of the commits the tree was in a very strange state, seemingly just wiki documentation), but I finally arrived at what I think is the troublesome commit:

9a66576860ef04570cfba6099c8a1a10b8238114

KungFuJesus commented 3 years ago

I should also add that's the first revision where the result for that line became uncompilable. I noticed some form of string corruption in form of the spacing around the equals disappear in a revision before that one (which maybe is the real source of the bug). Maybe after work I could bisect to find when that started occurring.

adrian-thurston commented 3 years ago

Hi @KungFuJesus thank you for the investigation! A lot of the time people don't follow up on these issues, so I was happy to see some activity here.

And yup, the ragel codebase has gone through some upheaval between 6 and 7. It's a bit of a long story :)

So that commit is the one that defaults to the colm-based frontend. I'm suspecting the bug is purely in colm. Would you be able to run the colm test suite on master? Depending on the result of that, I should also be able to craft some colm programs that exercise the bug. But want to see the test results first. I can also imagine some hacks to ragel that might reveal the bug, but need to leave that for tomorrow. (I'm PST).

adrian-thurston commented 3 years ago

cd test/colm.d && make && ../runtests

KungFuJesus commented 3 years ago

Hmm, colm.d directory appears to be missing under test.

KungFuJesus commented 3 years ago

Oh it's in the colm repo, I'll check that out. I'm using the distro provided ebuild for that, but I'm sure I can grab the exact version from a tag and try out the test.

Except...the one version that the distro provides (0.13.0.7) seems to have no tag...hmm. I'll try the version just before it and hope they are close enough.

KungFuJesus commented 3 years ago

Ok, so there wasn't a colm.d directory but I did do runtests, attached is the output.

colmout.log

KungFuJesus commented 3 years ago

Here's the output from the master branch, with surprisingly one segfault: colmout-master.log

Here are the resulting diffs for those: diffs.tar.gz

KungFuJesus commented 3 years ago

Looks as though the segfault is due to a linked list corruption that happens prior to a "print_kid" call:

(gdb) r < tcontext1-0.in > tcontext1.out
Starting program: /home/adam/colm/test/colm.d/working/tcontext1 < tcontext1-0.in > tcontext1.out

Program received signal SIGSEGV, Segmentation fault.
0x0000000010030730 in print_kid (prg=0x100632a0, sp=0x10073450, print_args=0x3fffffffc618, kid=0x3fffffffc520) at print.c:325
325                     ignore = ignore->next;
(gdb) bt
#0  0x0000000010030730 in print_kid (prg=0x100632a0, sp=0x10073450, print_args=0x3fffffffc618, kid=0x3fffffffc520) at print.c:325
#1  0x0000000010030e78 in colm_print_tree_args (prg=0x100632a0, sp=0x10073448, print_args=0x3fffffffc618, tree=0x3ffff7cd5178) at print.c:437
#2  0x00000000100326dc in colm_print_tree_collect_xml (prg=0x100632a0, sp=0x10073448, collect=0x3fffffffc710, tree=0x3ffff7cd5178, trim=0) at print.c:762
#3  0x0000000010015a50 in tree_to_str_xml (prg=0x100632a0, sp=0x10073448, tree=0x3ffff7cd5178, trim=0, attrs=0) at bytecode.c:161
#4  0x000000001001d040 in colm_execute_code (prg=0x100632a0, exec=0x3fffffffe938, sp=0x10073448, instr=0x1006080e <parser_rootCode+46> "\243\001\276U\265\021") at bytecode.c:1670
#5  0x0000000010017874 in colm_execute (prg=0x100632a0, exec=0x3fffffffe938, code=0x100607e0 <parser_rootCode> "\377\b\002") at bytecode.c:582
#6  0x0000000010002f9c in colm_run_program2 (prg=0x100632a0, argc=1, argv=0x3fffffffef28, argl=0x0) at program.c:222
#7  0x0000000010003038 in colm_run_program (prg=0x100632a0, argc=1, argv=0x3fffffffef28) at program.c:231
#8  0x000000001000265c in main (argc=1, argv=0x3fffffffef28) at working/tcontext1.c:793
(gdb) p ignore
$1 = (kid_t *) 0x300000000
(gdb) p *ignore
Cannot access memory at address 0x300000000
(gdb) p ignore->next
Cannot access memory at address 0x300000008
(gdb) up
#1  0x0000000010030e78 in colm_print_tree_args (prg=0x100632a0, sp=0x10073448, print_args=0x3fffffffc618, tree=0x3ffff7cd5178) at print.c:437
437         print_kid( prg, sp, print_args, &kid );
(gdb) l
432         term.next = 0;
433 
434         kid.tree = tree;
435         kid.next = &term;
436 
437         print_kid( prg, sp, print_args, &kid );
438     }
439 }
440 
441 void colm_print_null( program_t *prg, tree_t **sp,
(gdb) p kid
$2 = {tree = 0x3ffff7cd5178, next = 0x3fffffffc530}
(gdb) p kid->next
$3 = (struct colm_kid *) 0x3fffffffc530
(gdb) p kid->next->next
$4 = (struct colm_kid *) 0x0
adrian-thurston commented 3 years ago

Oh looks like something is badly broken somewhere. I'm suspecting there is something in the parsing that isn't friendly to big-endian. It's resulting in corrupt trees that sometimes don't print properly and sometimes segfault.

I'm wondering if there are other big-endian machines we can try this on. I used to use university machines to do that testing. Fastest way to get this fixed is for me to get on a machine where it exhibits.

KungFuJesus commented 3 years ago

Well...to get a little bit more diversity, perhaps, I can give you an account to a Solaris 11 based SPARC-T4 I have running. However, it's quite loud and has the added complexity that usually ensues when trying to build packages on Solaris, no matter how few dependencies (usually some frustration in autotools/automake/autoconf + GCC). Perhaps I can try to see if it builds on it, first, then I can get you a shell on it

rofl0r commented 3 years ago

I'm wondering if there are other big-endian machines we can try this on.

as mentioned there's GCC compile farm. https://gcc.gnu.org/wiki/CompileFarm

but you could also use a cheap mips device like a router, or qemu-system-mips running aboriginal linux system images, for example.

adrian-thurston commented 3 years ago

I'll register with the compile farm, thanks for the tip!

KungFuJesus commented 3 years ago

As for the MIPS route, they may not produce all endianness symptoms with big endian, as I believe a lot of embedded stuff is 32 bit. The linked issue above from sm64ex does not effect ppc32 big endian (not that I think we're seeing that issue here, just an example).

adrian-thurston commented 3 years ago

Just got access to the GCC compile farm and the problem exhibits on both SPARC and PPC64, exactly as you have shown above in the testcase output log.

adrian-thurston commented 3 years ago

@KungFuJesus @rofl0r Fixed the bug over in colm, and I confirmed that it fixes the issue generating the harfbuzz source. I will release a new version of ragel 7.0.3 that expects the version of colm with the fix. Thank you both for your work reporting and investigating this!

rofl0r commented 3 years ago

for the record, this is the commit fixing it: https://github.com/adrian-thurston/colm/commit/472a89059b9bec9f40f54d7ed5c312786c69198c

KungFuJesus commented 3 years ago

Ahh, so a similar class of issue to the automatic type casting and aliasing with non-uniform union members. Basically it's lobbing off the wrong half of the word.

I do sort of wonder if this is a gray area in the C specification or if this a very platform specific behavior defined by the ABI (similar to how all characters are assumed to be unsigned rather than signed on POWER).

thesamesam commented 3 years ago

Thank you for the work here, all. FWIW, I'm happy (on behalf of Gentoo) to try facilitate access to exotic architectures if it's needed for upstreams!

rofl0r commented 3 years ago

I do sort of wonder if this is a gray area in the C specification or if this a very platform specific behavior defined by the ABI (similar to how all characters are assumed to be unsigned rather than signed on POWER).

this was just plain and simple UB (undefined behaviour). actually the current typecast the fix uses smells like UB too... but then i didn't read the actual type definitions.

adrian-thurston commented 3 years ago

@rofl0r there is a problem if we try to push/pop something bigger in size than (void*). Perhaps an debug-mode assert somewhere would be nice defensive move.