Closed boegel closed 1 year ago
It was possible to trigger the underlying problem when bind-mounting the cvmfs
directory from the tarball created by the bot into a container image at /cvmfs
, and then running python -c 'from hashlib import blake2b
in the Gentoo Prefix environment (started with startprefix
):
$ python3.11 -c 'from hashlib import blake2b'
ERROR:root:code for hash blake2b was not found.
Traceback (most recent call last):
File "/cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64/usr/lib/python3.11/hashlib.py", line 307, in <module>
globals()[__func_name] = __get_hash(__func_name)
^^^^^^^^^^^^^^^^^^^^^^^
File "/cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64/usr/lib/python3.11/hashlib.py", line 129, in __get_openssl_constructor
return __get_builtin_constructor(name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64/usr/lib/python3.11/hashlib.py", line 123, in __get_builtin_constructor
raise ValueError('unsupported hash type ' + name)
ValueError: unsupported hash type blake2b
That wasn't consistently working though - sometimes this would trigger the problem, but not always for some reason...
A more direct import reveals the underlying problem:
$ python3.11 -c 'import _blake2'
Traceback (most recent call last):
File "<string>", line 1, in <module>
ImportError: libgomp.so.1: cannot open shared object file: No such file or directory
ldd /cvmfs/pilot.nessi.no/versions/2023.06/compat/linux/x86_64/usr/lib/python3.11/lib-dynload/_blake2.cpython-311-x86_64-linux-gnu.so
linux-vdso.so.1 (0x00007ffe5391b000)
libb2.so.1 => /cvmfs/pilot.nessi.no/versions/2023.06/compat/linux/x86_64/usr/lib64/libb2.so.1 (0x00007f4cbe9ad000)
libc.so.6 => /cvmfs/pilot.nessi.no/versions/2023.06/compat/linux/x86_64/lib64/libc.so.6 (0x00007f4cbe7db000)
/cvmfs/pilot.nessi.no/versions/2023.06/compat/linux/x86_64/lib64/ld-linux-x86-64.so.2 (0x00007f4cbe9c7000)
libgomp.so.1 => not found
Some painful debugging later, @trz42 figured out that the culprit was that the linker cache wasn't correctly populated: /cvmfs/pilot.nessi.no/versions/2023.06/compat/linux/aarch64/etc/ld.so.conf.d/05gcc-aarch64-unknown-linux-gnu.conf
was empty sometimes, which shouldn't happen, /cvmfs/pilot.nessi.no/versions/2023.06/compat/linux/aarch64/usr/lib/gcc/aarch64-unknown-linux-gnu/9.5.0
should be listed there (the location of libgomp.so
).
Eventually we found https://bugs.gentoo.org/865835, which describes the exact problem we're seeing.
It suggests that updating the GCC 10.x in the compat layer rather than sticking to GCC 9.5.0 may fix the problem, since a fix was backported only until GCC 10.x.
Although we're purposely sticking to an older version of GCC (by masking >=sys-devel/gcc-10
) so we can still install old versions of GCC in the software layer (for example GCC 9.3.0 which serves as a base for foss/2020a
), it seems like we're going in uncharted territory there since GCC 9.x is no longer actively maintained in Gentoo.
It also points to https://bugs.gentoo.org/show_bug.cgi?id=459038, which suggests that using gcc-config
may be problematic - we do this to make sure we're indeed using the intended GCC version as "system compiler" in the compat layer.
Because of the masking we do already, there's actually no need for this, since there will be only one GCC version installed.
Moreover, @amadio pointed out that we're actually using gcc-config
incorrectly: gcc-config 9.5.0
doesn't select GCC 9.5.0 as version to use at all - it expects a proper compiler "name" like x86_64-pc-linux-gnu-9.5.0
; cfr. https://github.com/gentoo/gcc-config
The linker cache is updated after running an emerge
command, which explains why the problem seemed difficult to reproduce. Running env-update
should also be enough to work around the problem.
In principle, it should have worked with a version, but it no longer does after we've moved to use only the major version in the profiles, as a single number will be interpreted as a number for the list of available items. I pointed this out on IRC, I think a fix will be put in place to let you say v10, v11, etc, to specify the version, to disambiguate with 1, 2, etc to choose an item from the list. Cheers,
Problem is fixed by removing the use of gcc-config
in our Ansible playbook used for building the compat layer in https://github.com/EESSI/compatibility-layer/pull/188, so closing this...
During the installation of
x11-base/xorg-proto
, which is required for a package included in the EESSI package set, we ran into some strange problems:That could be a red herring though, because the actual error that makes the installation of
x11-base/xorg-proto
fail seems to be this:Then again, maybe the check for Ninja is broken because of the
blake2b
issue, since Ninja is actually there in the compat layer (installed asdev-util/ninja
).This problem occurred when building the 2023.04 version of the compat layer - where we resorted back to manual build-and-deploy because the problem didn't present itself when manually resuming the Ansible playbook used to build the compat layer.
It occurred again when building the 2023.06 version of the compat layer, where we initially only wanted to use OpenSSL 1.1.1 rather than OpenSSL 3.x, even when using an updated Gentoo Prefix bootstrap script and a recent
gentoo
commit.