advancetoolchain / advance-toolchain

Advance Toolchain for Linux on Power build system.
https://www.ibm.com/support/pages/advance-toolchain-linux-power
Apache License 2.0
35 stars 27 forks source link

AT fails to build on RHEL 8 #1298

Closed ThinkOpenly closed 3 years ago

ThinkOpenly commented 4 years ago

I've now tried to build on two RHEL 8.1 systems, and they both fail identically:

...
Starting the build of gcc_2...
make: *** [Makefile:1342: /home/pc/advance-toolchain/at14.0-0-alpha.redhat-8_ppc64le_ppc64le/receipts/gcc_2.b.rcpt] Error 1

at14.0-0-alpha.redhat-8_ppc64le_ppc64le/logs/_gcc_2-3_standard_buildf-06_make.log:

configure: error: in `/home/pc/advance-toolchain/at14.0-0-alpha.redhat-8_ppc64le_ppc64le/builds/gcc_2/libiberty':
configure: error: cannot run C compiled programs.
If you meant to cross compile, use `--host'.
See `config.log' for more details
checking for string.h... /opt/at-next-14.0-0-alpha/bin/gcc -E
yes
checking for sys/stat.h... make[3]: *** [Makefile:10923: configure-stage1-libiberty] Error 1

at14.0-0-alpha.redhat-8_ppc64le_ppc64le/builds/gcc_2/libiberty/config.log:

configure:3495: checking whether we are cross compiling
configure:3503: /opt/at-next-14.0-0-alpha/bin/gcc -o conftest -g  -Wl,--dynamic-linker=/opt/at-next-14.0-0-alpha/lib64/ld64.so.2  conftest.c  >&5
configure:3507: $? = 0
configure:3514: ./conftest
/home/pc/advance-toolchain/at14.0-0-alpha.redhat-8_ppc64le_ppc64le/sources/gcc/libiberty/configure: line 3516: 103800 Segmentation fault      ./conftest$ac_cv_exeext
configure:3518: $? = 139
configure:3525: error: in `/home/pc/advance-toolchain/at14.0-0-alpha.redhat-8_ppc64le_ppc64le/builds/gcc_2/libiberty':
configure:3527: error: cannot run C compiled programs.

And indeed, the newly built linker does not produce working executables:

$ echo 'int main(){}' > conftest.c
$ /opt/at-next-14.0-0-alpha/bin/gcc -o conftest -g  -Wl,--dynamic-linker=/opt/at-next-14.0-0-alpha/lib64/ld64.so.2  conftest.c
$ ./conftest
Segmentation fault (core dumped)

It seems to be caused by using the newly built loader, the newly built exectable, and the system libc:

$ ldd ./conftest
        linux-vdso64.so.1 (0x00007fff85e00000)
        libc.so.6 => /lib64/power9/libc.so.6 (0x00007fff85bd0000)
        /opt/at-next-14.0-0-alpha/lib64/ld64.so.2 => /lib64/ld64.so.2 (0x00007fff85e20000)

Manually adding an rpath which finds the newly built libc instead produces a working executable:

$ /opt/at-next-14.0-0-alpha/bin/gcc -o conftest -g  -Wl,--dynamic-linker=/opt/at-next-14.0-0-alpha/lib64/ld64.so.2  conftest.c -Wl,-rpath=/opt/at-next-14.0-0-alpha/lib64
$ ldd ./conftest
        linux-vdso64.so.1 (0x00007fffbdf90000)
        libc.so.6 => /opt/at-next-14.0-0-alpha/lib64/libc.so.6 (0x00007fffbdd70000)
        /opt/at-next-14.0-0-alpha/lib64/ld64.so.2 => /lib64/ld64.so.2 (0x00007fffbdfb0000)
$ ./conftest
$ 
ThinkOpenly commented 4 years ago

@rff found the same basic problem, and a different workaround: tell the loader to ignore the cache.

$ /opt/at-next-14.0-0-alpha/lib64/ld64.so.2 --inhibit-cache ./conftest
$ 

The search order is obvious the factor here.

$ LD_DEBUG=libs ./conftest.ko
     91811:     find library=libc.so.6 [0]; searching
     91811:      search cache=/opt/at-next-14.0-0-alpha/etc/ld.so.cache
     91811:       trying file=/lib64/power9/libc.so.6
     91811:
Segmentation fault (core dumped)
$ LD_DEBUG=libs ./conftest.ok
     91840:     find library=libc.so.6 [0]; searching
     91840:      search path=/opt/at-next-14.0-0-alpha/lib64/tls/power9/altivec/dfp:/opt/at-next-14.0-0-alpha/lib64/tls/power9/altivec:/opt/at-next-14.0-0-alpha/lib64/tls/power9/dfp:/opt/at-next-14.0-0-alpha/lib64/tls/power9:/opt/at-next-14.0-0-alpha/lib64/tls/altivec/dfp:/opt/at-next-14.0-0-alpha/lib64/tls/altivec:/opt/at-next-14.0-0-alpha/lib64/tls/dfp:/opt/at-next-14.0-0-alpha/lib64/tls:/opt/at-next-14.0-0-alpha/lib64/power9/altivec/dfp:/opt/at-next-14.0-0-alpha/lib64/power9/altivec:/opt/at-next-14.0-0-alpha/lib64/power9/dfp:/opt/at-next-14.0-0-alpha/lib64/power9:/opt/at-next-14.0-0-alpha/lib64/altivec/dfp:/opt/at-next-14.0-0-alpha/lib64/altivec:/opt/at-next-14.0-0-alpha/lib64/dfp:/opt/at-next-14.0-0-alpha/lib64              (system search path)
     91840:       trying file=/opt/at-next-14.0-0-alpha/lib64/tls/power9/altivec/dfp/libc.so.6
     91840:       trying file=/opt/at-next-14.0-0-alpha/lib64/tls/power9/altivec/libc.so.6
     91840:       trying file=/opt/at-next-14.0-0-alpha/lib64/tls/power9/dfp/libc.so.6
     91840:       trying file=/opt/at-next-14.0-0-alpha/lib64/tls/power9/libc.so.6
     91840:       trying file=/opt/at-next-14.0-0-alpha/lib64/tls/altivec/dfp/libc.so.6
     91840:       trying file=/opt/at-next-14.0-0-alpha/lib64/tls/altivec/libc.so.6
     91840:       trying file=/opt/at-next-14.0-0-alpha/lib64/tls/dfp/libc.so.6
     91840:       trying file=/opt/at-next-14.0-0-alpha/lib64/tls/libc.so.6
     91840:       trying file=/opt/at-next-14.0-0-alpha/lib64/power9/altivec/dfp/libc.so.6
     91840:       trying file=/opt/at-next-14.0-0-alpha/lib64/power9/altivec/libc.so.6
     91840:       trying file=/opt/at-next-14.0-0-alpha/lib64/power9/dfp/libc.so.6
     91840:       trying file=/opt/at-next-14.0-0-alpha/lib64/power9/libc.so.6
     91840:       trying file=/opt/at-next-14.0-0-alpha/lib64/altivec/dfp/libc.so.6
     91840:       trying file=/opt/at-next-14.0-0-alpha/lib64/altivec/libc.so.6
     91840:       trying file=/opt/at-next-14.0-0-alpha/lib64/dfp/libc.so.6
     91840:       trying file=/opt/at-next-14.0-0-alpha/lib64/libc.so.6
     91840:
     91840:
     91840:     calling init: /opt/at-next-14.0-0-alpha/lib64/libc.so.6
     91840:
     91840:
     91840:     initialize program: ./conftest.ok
     91840:
     91840:
     91840:     transferring control: ./conftest.ok
     91840:
     91840:
     91840:     calling fini: ./conftest.ok [0]
     91840:
$ 
ThinkOpenly commented 4 years ago

Using -rpath adds an RPATH field in the executable (edited for brevity):

$ diff <(readelf -d ./conftest) <(readelf -d ./conftest.ok)
>  0x000000000000000f (RPATH)              Library rpath: [/opt/at-next-14.0-0-alpha/lib64]

The dependency with or without -rpath is the same:

 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]

man ld.so shows the dependencies are searched:

  1. DT_RPATH, unlessDT_RUNPATH present (I'm not 100% sure to which the above RPATH entry maps)
  2. LD_LIBRARY_PATH (not in use here)
  3. DT_RUNPATH
  4. cache ("Shared objects installed in hardware capability directories [...] are preferred to other shared objects.")
  5. default search path

...so the only intrinsic attribute that will override the cache is -rpath. But, why is this not a problem on other operating systems? More to come...

ThinkOpenly commented 4 years ago

On RHEL 8.1:

$ /opt/at-next-14.0-0-alpha/sbin/ldconfig -p | grep libc.so
        libc.so.6 (libc6,64bit, hwcap: 0x0000400000000000, OS ABI: Linux 3.10.0) => /lib64/power9/libc.so.6
        libc.so.6 (libc6,64bit, OS ABI: Linux 4.18.0) => /opt/at-next-14.0-0-alpha/lib64/libc.so.6
        libc.so.6 (libc6,64bit, OS ABI: Linux 3.10.0) => /lib64/libc.so.6

On Ubuntu 18.04.3:

$ /opt/at-next-14.0-0-alpha/sbin/ldconfig -p | grep libc.so
        libc.so.6 (libc6,64bit, OS ABI: Linux 4.15.0) => /opt/at-next-14.0-0-alpha/lib64/libc.so.6
        libc.so.6 (libc6,64bit, OS ABI: Linux 3.10.0) => /lib/powerpc64le-linux-gnu/libc.so.6

In the search order above, (4) searches the cache, and notes that "shared objects installed in hardware capability directories [...] are preferred to other shared objects."). So, due to the presence of /lib64/power9/libc.so.6 on the RHEL 8.1 system, it is chosen first.

ThinkOpenly commented 4 years ago

AT13 also fails for a similar reason, but differently. From at13.0-1-rc2.redhat-8_ppc64le_ppc64le/logs/_gcc_2-3_standard_buildf-06_make.log:

build/genautomata: /lib64/power9/libm.so.6: version `GLIBC_2.29' not found (required by build/genautomata)

Trivial executables do seem to work, allowing the configure steps to succeed:

$ echo 'int main(){}' > conftest.c
$ /opt/at13.0-1-rc2/bin/gcc -o conftest -g  -Wl,--dynamic-linker=/opt/at13.0-1-rc2/lib64/ld64.so.2 conftest.c
$ ./conftest
$ ldd ./conftest
        linux-vdso64.so.1 (0x00007fff8b330000)
        libc.so.6 => /lib64/power9/libc.so.6 (0x00007fff8b110000)
        /opt/at13.0-1-rc2/lib64/ld64.so.2 => /lib64/ld64.so.2 (0x00007fff8b350000)
$ LD_DEBUG=libs ./conftest
     36832:     find library=libc.so.6 [0]; searching
     36832:      search cache=/opt/at13.0-1-rc2/etc/ld.so.cache
     36832:       trying file=/lib64/power9/libc.so.6
     36832:
     36832:
     36832:     calling init: /lib64/power9/libc.so.6
     36832:
     36832:
     36832:     initialize program: ./conftest
     36832:
     36832:
     36832:     transferring control: ./conftest
     36832:
     36832:
     36832:     calling fini: ./conftest [0]
     36832:
tuliom commented 4 years ago

Using -rpath adds an RPATH field in the executable (edited for brevity):

GCC stage 2 is the first build that uses --with-advance-toolchain.

This is supposed to add the rpath to the built files: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config.gcc;h=ae5a845fccea2a2c0d8ba2275972f664a2a9a26e;hb=HEAD#l4975

Why isn't it working?

ThinkOpenly commented 4 years ago

Looks like that flag is acted upon in ...gcc/gcc/config.gcc, but it also looks like stage2 never gets there. Trying (in vain so far) to understand the GCC bootstrap procedure...

ThinkOpenly commented 4 years ago

I've found it challenging to fully understand the process. As Tulio said, --with-advance-toolchain is used for GCC stage 2 build. However, it is only applied in gcc/config.gcc (by atcfg_pre_hacks and atcfg_configure in configs/13.0/packages/gcc/stage_2). This may be too late, as other modules are built before gcc module. Here, intl fails:

make[3]: Entering directory '/home/pc/at13/at14.0-0-alpha.redhat-8_ppc64le_ppc64le/builds/gcc_2'
Configuring stage 1 in ./intl
configure: creating cache ./config.cache
checking for powerpc64le-linux-gnu-gcc... /opt/at-next-14.0-0-alpha/bin/gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... configure: error: in `/home/pc/at13/at14.0-0-alpha.redhat-8_ppc64le_ppc64le/builds/gcc_2/intl':
configure: error: cannot run C compiled programs.
tuliom commented 4 years ago

This started to happen after adding --with-stage1-ldflags="-Wl,--dynamic-linker=${ldso}" which forces the test program to use the AT loader at GCC stage 1. This parameter is not wrong, it was added in order to avoid mixing AT headers with the system libraries. More specifically, glibc header and libraries.

What is happening? Early in gcc_2, ${AT_DEST}/bin/gcc is executed with mixed glibc parts: the loader comes from AT, but libc comes from the system. That happens because the glibc loader favors CPU-optimized over general builds, e.g. /lib64/power9/libc.so.6 instead of ${AT_DEST}/lib64/libc.so.6. So, this error appears only when running a distro that provides CPU-optimized glibc for the processor that you're running, e.g. that happens on RHEL 8 on POWER9, but not on RHEL8 on POWER8.

I believe there are at least 4 possible solutions:

  1. Work around GCC stage 1 in gcc_2 using -rpath This is the solution from PR #1299. Nothing new here.

  2. Work around glibc_1 If glibc_1 provides symlinks the symlink ${AT_DEST}/lib64/power9, ldconfig_1 will prefer the symlinks from AT, creating a cache that will match with what is expected in gcc_2.

  3. Container-based build Run gcc_2 inside a container with only the files that are strictly needed. This is what most distros do, but is much more complex to implement, requiring to choose all the required files/packages in order to not disable an important feature by mistake.

  4. Work around GCC stage 1 in gcc_2 by using system's headers Let GCC stage 1 in gcc_2 use the system's headers. It may be necessary to copy mpc, mpfr and gmp to a another place in order to avoid picking the headers from other AT-provided libraries. Libraries built with the system's headers may disable features.

Notice that we may need to adopt different solutions for AT next and AT <= 13.0 in order to guarantee their stability.

mscastanho commented 4 years ago
ThinkOpenly commented 4 years ago

I echo @mscastanho : (2) is not bad, and arguably works around a bug in the loader (if pathA is before pathB in the loader path, should the loader really accept pathB/opt/libc.so before pathA/libc.so?) (3) would be a nice longer term solution, but is likely a fair bit of work.

ThinkOpenly commented 3 years ago

I looked at whether pull #1299 solves this issue. While it does help with older versions of AT, it does not help with current version (15). I do note that GCC stage 2 was removed (commit b1f57fa69b49064c30dc39e0db71398962267aea), which is the stage in which that pull request made changes, but making similar changes in stage 3 did not help. I'm still investigating, but GCC stage 3 fails in the configure step running: powerpc64le-linux-gnu-gcc -o conftest conftest.c

Note that there are no flags used. Results:

$ ./conftest
./conftest: /lib64/power9/libc.so.6: version `GLIBC_2.34' not found (required by ./conftest)

Note that it seeks to run an AT-linked executable with the system libc.

ThinkOpenly commented 3 years ago

OK, with a few symlinks added at the end of GCC stage 1, the build completes successfully! Here's the (minimal) patch:

diff --git a/configs/11.0/packages/gcc/stage_1 b/configs/11.0/packages/gcc/stage_1
index 2df98ea19e2c..c36c68fe23ab 100644
--- a/configs/11.0/packages/gcc/stage_1
+++ b/configs/11.0/packages/gcc/stage_1
@@ -207,0 +208,5 @@ atcfg_post_install() {
+
+               mkdir -p "${at_dest}/lib64/power9"
+               ln -s "../libc.so.6" "${at_dest}/lib64/power9/libc.so.6"
+               ln -s "../libm.so.6" "${at_dest}/lib64/power9/libm.so.6"
+

The next question is what is the best fix? The above is minimal and only works iff there is exactly one "optimized" subdirectory on the system. Shall I go find all of the files in the system library directories which contain a library which matches any library already built for AT / GCC stage 1, and create symlinks for all of them? (Unknown to me at the moment: do I need to back at some point and clean all of that up? With the patch above, it did not seem that removing the symlinks was needed.)

tuliom commented 3 years ago

Shall I go find all of the files in the system library directories which contain a library which matches any library already built for AT / GCC stage 1, and create symlinks for all of them?

@ThinkOpenly Thinking in the long term and considering future processors, I do think it's ideal if all libraries from glibc have their own symlink for each entry in BUILD_ACTIVE_MULTILIBS. I wonder if it works if we just symlink the processor directory, e.g. ln -s "${at_dest}/lib64" "${at_dest}/lib64/power9"

do I need to back at some point and clean all of that up?

Yes, you do. Otherwise, the processor-optimized build from AT will overwrite the default build, e.g. a P9 libc.so.6 will be placed in ${at_dest}/lib64/ causing issues when running on P8.

Notice that we don't have to modify the contents in ${at_dest}/lib64 for this, it might be easier to revert your work if you benefit from tmp/ in the build directory and just add an extra file to ${at_dest}/ld.so.conf.d/ pointing to the directory you created. In the end, you can just remove this file later. This might also help to avoid conflicting writes to processor-optimized directories.

ThinkOpenly commented 3 years ago

Shall I go find all of the files in the system library directories which contain a library which matches any library already built for AT / GCC stage 1, and create symlinks for all of them?

@ThinkOpenly Thinking in the long term and considering future processors, I do think it's ideal if all libraries from glibc have their own symlink for each entry in BUILD_ACTIVE_MULTILIBS.

Is BUILD_ACTIVE_MULTILIBS guaranteed to be a superset of whatever the AT loader looks for?

I wonder if it works if we just symlink the processor directory, e.g. ln -s "${at_dest}/lib64" "${at_dest}/lib64/power9"

I will try that.

do I need to back at some point and clean all of that up?

Yes, you do. Otherwise, the processor-optimized build from AT will overwrite the default build, e.g. a P9 libc.so.6 will be placed in ${at_dest}/lib64/ causing issues when running on P8.

Indeed. It's a bit ugly to have cross-stage dependencies like that... hmm. I wonder when the clean up step should be inserted?

Notice that we don't have to modify the contents in ${at_dest}/lib64 for this, it might be easier to revert your work if you benefit from tmp/ in the build directory and just add an extra file to ${at_dest}/ld.so.conf.d/ pointing to the directory you created. In the end, you can just remove this file later. This might also help to avoid conflicting writes to processor-optimized directories.

Modifications there require a subsequent ldconfig step, correct?

Might this also prevent the need for a clean-up step?

tuliom commented 3 years ago

Is BUILD_ACTIVE_MULTILIBS guaranteed to be a superset of whatever the AT loader looks for?

@ThinkOpenly No, but usually AT is ahead of the distros, e.g. we started building optimized libraries for P10 1 year ago and distros haven't adopted this yet.

I wonder when the clean up step should be inserted?

We don't have to hurry, but it has to happen before the last ldconfig execution (ldconfig_2).

Modifications there require a subsequent ldconfig step, correct?

Correct. You need one after creation and another after removal. Luckily we ldconfig_1 executes after glibc_1 and ldconfig_2 executes after glibc_2. So, I think we could use the post-install hacks in both glibc steps to take care of this.

Might this also prevent the need for a clean-up step?

I'm not sure I understand your point. You may not need to remove the files from tmp/, but it's still very important to remove the file from ${at_dest}/ld.so.conf.d/.

ThinkOpenly commented 3 years ago

Shall I go find all of the files in the system library directories which contain a library which matches any library already built for AT / GCC stage 1, and create symlinks for all of them?

@ThinkOpenly Thinking in the long term and considering future processors, I do think it's ideal if all libraries from glibc have their own symlink for each entry in BUILD_ACTIVE_MULTILIBS. I wonder if it works if we just symlink the processor directory, e.g. ln -s "${at_dest}/lib64" "${at_dest}/lib64/power9"

This latter suggestion, unfortunately, does not work. ldconfig checks the inode number for all directories added to the search path, and will not add the directory if the inode number matches. (I understand why, but this seems pretty close to being a bug given the use-case here.) I will try actual subdirectories containing symlinks for all of the files in the parent.

ThinkOpenly commented 3 years ago

I am making progress with the approach suggested to add a new directory to ld.so.conf which supercedes the system libraries by having arch-specific directories. I populated these directories by copying files from the AT lib64 directory. The build completes successfully, but FVTR fails (a manually created summary follows):

investigating...

ThinkOpenly commented 3 years ago

The AT-built readelf command is failing:

/home/pc/opt8/at15.0-0-alpha/bin/readelf: symbol lookup error: /lib64/libk5crypto.so.3: undefined symbol: EVP_KDF_ctrl, version OPENSSL_1_1_1b

The dependency list from ldd is:

file=libdebuginfod.so.1 [0];  needed by /home/pc/opt8/at15.0-0-alpha/bin/readelf [0]
file=libcurl.so.4 [0];  needed by /lib64/libdebuginfod.so.1 [0]
file=libssl.so.1.1 [0];  needed by /lib64/libcurl.so.4 [0]
  trying file=/home/pc/opt8/at15.0-0-alpha/lib64/power9/libssl.so.1.1
file=libk5crypto.so.3 [0];  needed by /lib64/libcurl.so.4 [0]

AT contains OpenSSL 1.1.1k, and 'b' is before 'k', last I checked.

So, does this imply that AT might need to include an updated Kerberos libraries package?

ThinkOpenly commented 3 years ago

Possibly instructive that openssl and krb5-libs are tightly bound: https://unix.stackexchange.com/questions/594618/git-push-error-undefined-symbol-evp-kdf-ctrl-version-openssl-1-1-1b

ThinkOpenly commented 3 years ago

Even more interesting, in that the "EVP_KDF" support is apparently a Red Hat downstream add: https://github.com/openssl/openssl/issues/11471

tuliom commented 3 years ago

@ThinkOpenly That's issue #1969 .

ThinkOpenly commented 3 years ago

Ugh. So, are the choices:

  1. Support the (unstable) EVP_KDF API
  2. Import krb5-libs into AT ?

While there are certainly problems with (1), a problem with not doing (1) is that the libssl in AT does not provide the complete API as the one in RHEL8. :-/

tuliom commented 3 years ago

Why is readelf depending on libk5crypto.so.3? If this isn't an important feature, we could remove it. But we would continue having the same issue as reported in #1969 .

ThinkOpenly commented 3 years ago

Why is readelf depending on libk5crypto.so.3?

The dependency chain is above, repeated here: readelf -> libdebuginfod -> libcurl -> {libssl and libk5crypto}

Why libdebuginfod depends on libcurl is an interesting question.

If this isn't an important feature, we could remove it. But we would continue having the same issue as reported in #1969 .

So, it's not something we can remove, because both dependencies are in libcurl, not part of AT.

ThinkOpenly commented 3 years ago

Any idea if I need to fix something to address the issues reported by the ck_requires test?

Other than that, the other FVTR failures are apparently limited to the issue reported in #1969.

The AT RPMs are built now. Shall I submit a pull request for the changes I have in hand which seems to address this issue, then we can pivot to #1969 separately?