Open johnsonjh opened 5 months ago
So the cause of the problem is that out=0xffffffffffff8930 is not a valid address. While doing some debugging, it seems this was getting passed in to the _sir_msec_since
so the problem was in _sir_once
.
It definitely looks like a compiler bug. I made a simple example based on your code:
static _sir_thread_local sir_time _sir_last_thrd_chk = {0};
void _sir_msec_since(sir_time* in, sir_time* out);
int main() {
sir_time thrd_chk;
_sir_msec_since(&_sir_last_thrd_chk, &thrd_chk);
When compiling on AIX with GCC 8, I get:
mr 3,10
mr 4,9
bla __tls_get_addr
mr 9,3
mr 10,9
addi 9,31,112 # r31 is saved r1
mr 4,9
mr 3,10
bl ._sir_msec_since
When compiling on PASE with GCC 10, I get:
mr 3,10
mr 4,9
bla __tls_get_addr
mr 9,3
mr 3,9
bl ._sir_msec_since
So there's a few things wrong with the PASE output. First and most importantly, it doesn't set up r4 which would be the out
parameter. Secondly, the mr 9, 3
mr 3, 9
looks messed up as it essentially does nothing.
Looking at AIX GCC specs this patch might be relevant: https://public.dhe.ibm.com/aix/freeSoftware/aixtoolbox/PATCHES/gcc-8.4.0-gcc-WORKAROUND-TLS-TOC-constants.patch
And looks like it was fixed upstream in https://github.com/gcc-mirror/gcc/commit/a21b399708175f6fc0ac723a0cebc127da421c60
Looking at the assembly output more closely, it seems that in the PASE generated code, the r4 value gets generated much earlier:
mr 31,1
addi 9,31,112
mr 4,9
ld 10,LCM..0(2)
ld 9,LC..0(2)
mr 3,10
mr 4,9
bla __tls_get_addr
So the problem is that it's getting stomped by the call to __tls_get_addr
and the patches make changes to avoid that.
FYI GCC 10 with a patch for this issue and updated to 10.5 has been uploaded to our repo.
Great. Once I can test the update I'll close the issue. Thanks for quickly investigating and getting it resolved.
I see the source packages now (https://public.dhe.ibm.com/software/ibmi/products/pase/rpms/repo-base-7.3/src/gcc10-10.5.0-1.src.rpm) but not the binary builds yet - I assume they're coming soon.
I'll see if I can build myself in the meantime, but I may have to wait for the built RPMs to surface before I can test.
@kadler
I was able to test your updated GCC 10.5.0 package.
Since I didn't see the binary RPM packages available yet, I built it myself from your SRPM sources, and tested all previously failing libsir cases. Everything is now working successfully!
I did find one minor annoyance possibly worth mentioning.
The -Wpedantic
flag causes a warning regarding the use of the #include_next
GNU extension for most files compiled, due to the usage of this extension in the GCC generated fixed-includes. The warning, while technically correct, is annoying and precludes the use of -Wpedantic
. This warning doesn't seem be triggered with 10.3.0-12, so it might be considered a regression.
This might also be my fault, as I built from your SRPM's source+patches (rather than via rpmbuild using the spec and might have missed something), but if it isn't, it might be a good idea to add a patch which suppresses this warning, even when -pedantic
or -Wpedantic
is enabled. This could possibly trip someone up if they have build recipes enabling pedantic
in combination with -Werror
.
Other than this, everything seems great. Thanks!
@kadler Here's the bug report as requested. Unfortunately, I've not been able to narrow down a smaller reproducer, but here goes anyway. If I can find a smaller reproducer, I'll update this issue.
I have a program (the test program for a library) that is crashing under PASE for i 7.5 - and only under PASE (not AIX), and only when it's built without optimization, and even then, only when it's built with IBM's GCC 10.3.0 under PASE for i.
This particular code has been tested in the same way without any problems on 19 other operating systems/runtimes - 1) IBM AIX 7.2, 2) IBM AIX 7.3, 3) Linux/glibc, 4) Linux/musl, 5) Linux/uClibc-ng, 6) Android, 7) macOS, 8) Windows, 9) Cygwin, 10) FreeBSD, 11) NetBSD, 12) OpenBSD, 13) DragonFly BSD, 14) GNU/Hurd, 15) Haiku, 16) illumos, 17) Solaris, 18) SerenityOS, and 19) WASM under Node.js - all without any problems whatsoever.
I'm also quite confident this is not a bug in our code. We have a clean bill of health from Valgrind, PurifyPlus, MSVC's Static Analyzer, SonarCloud's SonarLint, Oracle Lint, GCC's Static Analyzer, Cppcheck, Coverity, and PVS-Studio which have all tested this code path extensively.
I should also note, for this case case (
-O0
+DEBUG
+SELFLOG
):Using my own Linux-based GCC cross-compilation toolchain to target AIX (based on GCC 12.2.1 (20221104) and AIX 7.2 TL5) results in binaries that do not crash on AIX.
Using my own Linux-based GCC cross-compilation toolchain to target PASE for i (based on GCC 12.2.1 (20221104) and AIX 7.2 TL5, defining
__PASE__
in the preprocessor) results in binaries that do not crash on PASE for i 7.5.Using IBM's GCC 10.3.0 (10.3.0-6) on IBM AIX 7.2 (7200-05-07-2346) results in binaries that do not crash on AIX.
Using other compilers including GCC 8, GCC 11, IBM XL C/C++ V16, and Open XL C/C++ V17 on IBM AIX 7.2 and IBM AIX 7.3 results in binaries that do not crash on AIX.
Using IBM's GCC 10.3.0 (10.3.0-6) on IBM AIX 7.2 (7200-05-07-2346), defining
__PASE__
, results in binaries that do not crash when copied to PASE for i 7.5.Using IBM's GCC 8.3.0 (8.3.0-6) on IBM AIX 7.2 (7200-05-07-2346), defining
__PASE__
, results in binaries that do not crash when copied to PASE for i 7.5.Using IBM XL C V16.1 (16.1.0.15) on IBM AIX 7.2 (7200-05-07-2346), defining
__PASE__
, results in binaries that do not crash when copied to PASE for i 7.5.Using IBM Open XL C V17.1 (17.1.2.1, clang 17.0.5, build 1c995dd), on IBM AIX 7.2 (7200-05-07-2346), defining
__PASE__
, results in binaries that do not crash when copied to PASE for i 7.5.Copying binaries built under IBM PASE for i 7.5 using IBM's GCC 10.3.0 (20210408, IBM 10.3.0-12, IBM i) back to the IBM AIX 7.2 TL5 SP7 system do crash, the same as they do under PASE.
So, after all this testing, I believe this must be a compiler bug in the IBM's GCC 10.3.0-12 on IBM i.
Steps to reproduce (under PASE for i 7.5):
git clone https://github.com/aremmell/libsir.git
cd libsir
git checkout 11bc621245039e4b5ba20b9bd8a964e36f5f5162
(for consistency, keeping to a known commit)env CC=gcc-10 gmake SIR_DEBUG=1 SIR_SELFLOG=1 DBGFLAGS="-O0 -g2"
./build/bin/sirtests
Result:
GDB backtrace:
Other observations:
Building this configuration under PASE for i 7.5 with IBM's GCC 10.3.0-12, but with any optimizations enabled, e.g.,
env CC=gcc-10 gmake SIR_DEBUG=1 SIR_SELFLOG=1 DBGFLAGS="-Og -g2"
, that is, with-Og
or-O1
(instead of-O0
) results in code that does not crash.The crash only happens with
-O0
when building with bothSIR_DEBUG=1
andSIR_SELFLOG=1
enabled. Building other flavors likeDEBUG
alone orSELFLOG
alone does not trigger the crash.I've used PUB400, which is running IBM i 7.5 and has GCC 10.3.0 installed, and reproduced this there as well.