IBM / ibmi-oss-issues

Important resources for anyone interested in open source on IBM i
Creative Commons Zero v1.0 Universal
13 stars 0 forks source link

Possible miscompilation with gcc version 10.3.0 (20210408, IBM 10.3.0-12) (IBM i) on PASE for IBM i 7.5 #54

Open johnsonjh opened 5 months ago

johnsonjh commented 5 months ago

@kadler Here's the bug report as requested. Unfortunately, I've not been able to narrow down a smaller reproducer, but here goes anyway. If I can find a smaller reproducer, I'll update this issue.

I have a program (the test program for a library) that is crashing under PASE for i 7.5 - and only under PASE (not AIX), and only when it's built without optimization, and even then, only when it's built with IBM's GCC 10.3.0 under PASE for i.

This particular code has been tested in the same way without any problems on 19 other operating systems/runtimes - 1) IBM AIX 7.2, 2) IBM AIX 7.3, 3) Linux/glibc, 4) Linux/musl, 5) Linux/uClibc-ng, 6) Android, 7) macOS, 8) Windows, 9) Cygwin, 10) FreeBSD, 11) NetBSD, 12) OpenBSD, 13) DragonFly BSD, 14) GNU/Hurd, 15) Haiku, 16) illumos, 17) Solaris, 18) SerenityOS, and 19) WASM under Node.js - all without any problems whatsoever.

I'm also quite confident this is not a bug in our code. We have a clean bill of health from Valgrind, PurifyPlus, MSVC's Static Analyzer, SonarCloud's SonarLint, Oracle Lint, GCC's Static Analyzer, Cppcheck, Coverity, and PVS-Studio which have all tested this code path extensively.

I should also note, for this case case (-O0 + DEBUG + SELFLOG):

So, after all this testing, I believe this must be a compiler bug in the IBM's GCC 10.3.0-12 on IBM i.

Steps to reproduce (under PASE for i 7.5):

  1. git clone https://github.com/aremmell/libsir.git
  2. cd libsir
  3. git checkout 11bc621245039e4b5ba20b9bd8a964e36f5f5162 (for consistency, keeping to a known commit)
  4. env CC=gcc-10 gmake SIR_DEBUG=1 SIR_SELFLOG=1 DBGFLAGS="-O0 -g2"
  5. ./build/bin/sirtests

    Result:

    libsir 2.2.5-dev (prerelease) running 36 tests...
    
    (1/36) 'thread-race'...
    
    _sir_setcolormode (src/sirtextstyle.c:212): skipped superfluous update of color mode: 0
    _sir_syslog_reset (src/sirinternal.c:1145): state reset; mask was 00000000
    _sir_init (src/sirinternal.c:200): initialized successfully
          waiting for thread 1/4...
    _sir_fcache_add (src/sirfilecache.c:425): adding file (path: './logs/multi-thread-race-0.log', id: 95bee59); count = 1
          hi, i'm thread (id: 258), logging to: './logs/multi-thread-race-0.log'...
    _sir_logv (src/sirinternal.c:556): updating hostname...
    _sir_fcache_pred_path (src/sirfilecache.c:494): comparing './logs/multi-thread-race-1.log' == './logs/multi-thread-race-0.log'
    _sir_logv (src/sirinternal.c:561): hostname: '[redacted]'
    Segmentation fault (core dumped)

    GDB backtrace:

    Program terminated with signal SIGSEGV, Segmentation fault.
    #0  0x00000001000159f0 in _sir_msec_since (when=0x18004f130, out=0xffffffffffff8930) at src/sirinternal.c:1226
    1226        out->sec = 0;
    (gdb) bt
    #0  0x00000001000159f0 in _sir_msec_since (when=0x18004f130, out=0xffffffffffff8930) at src/sirinternal.c:1226
    #1  0x0000000100013844 in _sir_once (level=128, format=0x100028be0 <_GLOBAL__FD_sirtests+14160> "log message #%zu", 
      args=0x18004e718 "") at src/sirinternal.c:582
    #2  0x0000000100021a34 in _sir_makergb (format=0x100028be0 <_GLOBAL__FD_sirtests+14160> "log message #%zu") at src/sir.c:71
    #3  0x000000010000c178 in threadrace_thread (arg=0x1800195b0) at tests/tests.c:2319
    #4  0x090000000032805c in _pthread_body () from /usr/lib/libpthreads.a(shr_xpg5_64.o)
    #5  0x0000000000000000 in ?? ()

Other observations:

kadler commented 4 months ago

So the cause of the problem is that out=0xffffffffffff8930 is not a valid address. While doing some debugging, it seems this was getting passed in to the _sir_msec_since so the problem was in _sir_once.

It definitely looks like a compiler bug. I made a simple example based on your code:

static _sir_thread_local sir_time _sir_last_thrd_chk = {0};

void _sir_msec_since(sir_time* in, sir_time* out);

int main() {
    sir_time thrd_chk;
    _sir_msec_since(&_sir_last_thrd_chk, &thrd_chk);

When compiling on AIX with GCC 8, I get:

        mr 3,10
        mr 4,9
        bla __tls_get_addr
        mr 9,3
        mr 10,9
        addi 9,31,112 # r31 is saved r1
        mr 4,9
        mr 3,10
        bl ._sir_msec_since

When compiling on PASE with GCC 10, I get:

        mr 3,10
        mr 4,9
        bla __tls_get_addr
        mr 9,3
        mr 3,9
        bl ._sir_msec_since

So there's a few things wrong with the PASE output. First and most importantly, it doesn't set up r4 which would be the out parameter. Secondly, the mr 9, 3 mr 3, 9 looks messed up as it essentially does nothing.

Looking at AIX GCC specs this patch might be relevant: https://public.dhe.ibm.com/aix/freeSoftware/aixtoolbox/PATCHES/gcc-8.4.0-gcc-WORKAROUND-TLS-TOC-constants.patch

And looks like it was fixed upstream in https://github.com/gcc-mirror/gcc/commit/a21b399708175f6fc0ac723a0cebc127da421c60

kadler commented 4 months ago

Looking at the assembly output more closely, it seems that in the PASE generated code, the r4 value gets generated much earlier:

        mr 31,1
        addi 9,31,112
        mr 4,9
        ld 10,LCM..0(2)
        ld 9,LC..0(2)
        mr 3,10
        mr 4,9
        bla __tls_get_addr

So the problem is that it's getting stomped by the call to __tls_get_addr and the patches make changes to avoid that.

kadler commented 4 months ago

FYI GCC 10 with a patch for this issue and updated to 10.5 has been uploaded to our repo.

johnsonjh commented 4 months ago

Great. Once I can test the update I'll close the issue. Thanks for quickly investigating and getting it resolved.

johnsonjh commented 4 months ago

I see the source packages now (https://public.dhe.ibm.com/software/ibmi/products/pase/rpms/repo-base-7.3/src/gcc10-10.5.0-1.src.rpm) but not the binary builds yet - I assume they're coming soon.

I'll see if I can build myself in the meantime, but I may have to wait for the built RPMs to surface before I can test.

johnsonjh commented 4 months ago

@kadler

I was able to test your updated GCC 10.5.0 package.

Since I didn't see the binary RPM packages available yet, I built it myself from your SRPM sources, and tested all previously failing libsir cases. Everything is now working successfully!

I did find one minor annoyance possibly worth mentioning.

The -Wpedantic flag causes a warning regarding the use of the #include_next GNU extension for most files compiled, due to the usage of this extension in the GCC generated fixed-includes. The warning, while technically correct, is annoying and precludes the use of -Wpedantic. This warning doesn't seem be triggered with 10.3.0-12, so it might be considered a regression.

This might also be my fault, as I built from your SRPM's source+patches (rather than via rpmbuild using the spec and might have missed something), but if it isn't, it might be a good idea to add a patch which suppresses this warning, even when -pedantic or -Wpedantic is enabled. This could possibly trip someone up if they have build recipes enabling pedantic in combination with -Werror.

Other than this, everything seems great. Thanks!

kadler commented 4 months ago

Binary packages are available, eg. gcc10

You maybe needed to refresh your yum cache, eg,. yum clean metadata then updating.