EESSI / compatibility-layer

Compatibility layer of the EESSI project
https://eessi.github.io/docs/compatibility_layer
GNU General Public License v2.0
11 stars 21 forks source link

problem when installing `x11-base/xorg-proto` package in EESSI 2023.04 and 2023.06 #187

Closed boegel closed 1 year ago

boegel commented 1 year ago

During the installation of x11-base/xorg-proto, which is required for a package included in the EESSI package set, we ran into some strange problems:

TASK [compatibility_layer : Install package set ['eessi-2023.06-linux-x86_64']] ***
failed: [localhost] (item=eessi-2023.06-linux-x86_64) => {"ansible_loop_var": "item", "changed": false, "cmd": ["/cvmfs/pilot.nessi.no/versions/2023.06/compat/linux/x86_64/usr/bin/emerge", "--noreplace", "--ask=n", "@eessi-2023.06-linux-x86_64"], "item": "eessi-2023.06-linux-x86_64", "msg": "Packages not installed.", "rc": 1, "stderr": "ERROR:root:code for hash blake2b was not found.
Traceback (most recent call last):
  File \"/cvmfs/pilot.nessi.no/versions/2023.06/compat/linux/x86_64/usr/lib/python3.11/hashlib.py\", line 307, in <module>
    globals()[__func_name] = __get_hash(__func_name)
                             ^^^^^^^^^^^^^^^^^^^^^^^
  File \"/cvmfs/pilot.nessi.no/versions/2023.06/compat/linux/x86_64/usr/lib/python3.11/hashlib.py\", line 129, in __get_openssl_constructor
    return __get_builtin_constructor(name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/cvmfs/pilot.nessi.no/versions/2023.06/compat/linux/x86_64/usr/lib/python3.11/hashlib.py\", line 123, in __get_builtin_constructor
    raise ValueError('unsupported hash type ' + name)
ValueError: unsupported hash type blake2b
ERROR:root:code for hash blake2s was not found.

That could be a red herring though, because the actual error that makes the installation of x11-base/xorg-proto fail seems to be this:

Could not detect Ninja v1.8.2 or newer

Then again, maybe the check for Ninja is broken because of the blake2b issue, since Ninja is actually there in the compat layer (installed as dev-util/ninja).

This problem occurred when building the 2023.04 version of the compat layer - where we resorted back to manual build-and-deploy because the problem didn't present itself when manually resuming the Ansible playbook used to build the compat layer.

It occurred again when building the 2023.06 version of the compat layer, where we initially only wanted to use OpenSSL 1.1.1 rather than OpenSSL 3.x, even when using an updated Gentoo Prefix bootstrap script and a recent gentoo commit.

boegel commented 1 year ago

It was possible to trigger the underlying problem when bind-mounting the cvmfs directory from the tarball created by the bot into a container image at /cvmfs, and then running python -c 'from hashlib import blake2b in the Gentoo Prefix environment (started with startprefix):

$ python3.11 -c 'from hashlib import blake2b'
ERROR:root:code for hash blake2b was not found.
Traceback (most recent call last):
  File "/cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64/usr/lib/python3.11/hashlib.py", line 307, in <module>
    globals()[__func_name] = __get_hash(__func_name)
                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64/usr/lib/python3.11/hashlib.py", line 129, in __get_openssl_constructor
    return __get_builtin_constructor(name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64/usr/lib/python3.11/hashlib.py", line 123, in __get_builtin_constructor
    raise ValueError('unsupported hash type ' + name)
ValueError: unsupported hash type blake2b

That wasn't consistently working though - sometimes this would trigger the problem, but not always for some reason...

boegel commented 1 year ago

A more direct import reveals the underlying problem:

$ python3.11 -c 'import _blake2'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: libgomp.so.1: cannot open shared object file: No such file or directory
ldd /cvmfs/pilot.nessi.no/versions/2023.06/compat/linux/x86_64/usr/lib/python3.11/lib-dynload/_blake2.cpython-311-x86_64-linux-gnu.so
        linux-vdso.so.1 (0x00007ffe5391b000)
        libb2.so.1 => /cvmfs/pilot.nessi.no/versions/2023.06/compat/linux/x86_64/usr/lib64/libb2.so.1 (0x00007f4cbe9ad000)
        libc.so.6 => /cvmfs/pilot.nessi.no/versions/2023.06/compat/linux/x86_64/lib64/libc.so.6 (0x00007f4cbe7db000)
        /cvmfs/pilot.nessi.no/versions/2023.06/compat/linux/x86_64/lib64/ld-linux-x86-64.so.2 (0x00007f4cbe9c7000)
        libgomp.so.1 => not found

Some painful debugging later, @trz42 figured out that the culprit was that the linker cache wasn't correctly populated: /cvmfs/pilot.nessi.no/versions/2023.06/compat/linux/aarch64/etc/ld.so.conf.d/05gcc-aarch64-unknown-linux-gnu.conf was empty sometimes, which shouldn't happen, /cvmfs/pilot.nessi.no/versions/2023.06/compat/linux/aarch64/usr/lib/gcc/aarch64-unknown-linux-gnu/9.5.0 should be listed there (the location of libgomp.so).

boegel commented 1 year ago

Eventually we found https://bugs.gentoo.org/865835, which describes the exact problem we're seeing.

It suggests that updating the GCC 10.x in the compat layer rather than sticking to GCC 9.5.0 may fix the problem, since a fix was backported only until GCC 10.x. Although we're purposely sticking to an older version of GCC (by masking >=sys-devel/gcc-10) so we can still install old versions of GCC in the software layer (for example GCC 9.3.0 which serves as a base for foss/2020a), it seems like we're going in uncharted territory there since GCC 9.x is no longer actively maintained in Gentoo.

It also points to https://bugs.gentoo.org/show_bug.cgi?id=459038, which suggests that using gcc-config may be problematic - we do this to make sure we're indeed using the intended GCC version as "system compiler" in the compat layer. Because of the masking we do already, there's actually no need for this, since there will be only one GCC version installed.

Moreover, @amadio pointed out that we're actually using gcc-config incorrectly: gcc-config 9.5.0 doesn't select GCC 9.5.0 as version to use at all - it expects a proper compiler "name" like x86_64-pc-linux-gnu-9.5.0; cfr. https://github.com/gentoo/gcc-config

The linker cache is updated after running an emerge command, which explains why the problem seemed difficult to reproduce. Running env-update should also be enough to work around the problem.

amadio commented 1 year ago

In principle, it should have worked with a version, but it no longer does after we've moved to use only the major version in the profiles, as a single number will be interpreted as a number for the list of available items. I pointed this out on IRC, I think a fix will be put in place to let you say v10, v11, etc, to specify the version, to disambiguate with 1, 2, etc to choose an item from the list. Cheers,

boegel commented 1 year ago

Problem is fixed by removing the use of gcc-config in our Ansible playbook used for building the compat layer in https://github.com/EESSI/compatibility-layer/pull/188, so closing this...