Open casparvl opened 4 months ago
Instance eessi-bot-mc-aws
is configured to build:
x86_64/generic
for repo eessi-hpc.org-2023.06-compat
x86_64/generic
for repo eessi-hpc.org-2023.06-software
x86_64/generic
for repo eessi.io-2023.06-compat
x86_64/generic
for repo eessi.io-2023.06-software
x86_64/intel/haswell
for repo eessi-hpc.org-2023.06-compat
x86_64/intel/haswell
for repo eessi-hpc.org-2023.06-software
x86_64/intel/haswell
for repo eessi.io-2023.06-compat
x86_64/intel/haswell
for repo eessi.io-2023.06-software
x86_64/intel/skylake_avx512
for repo eessi-hpc.org-2023.06-compat
x86_64/intel/skylake_avx512
for repo eessi-hpc.org-2023.06-software
x86_64/intel/skylake_avx512
for repo eessi.io-2023.06-compat
x86_64/intel/skylake_avx512
for repo eessi.io-2023.06-software
x86_64/amd/zen2
for repo eessi-hpc.org-2023.06-compat
x86_64/amd/zen2
for repo eessi-hpc.org-2023.06-software
x86_64/amd/zen2
for repo eessi.io-2023.06-compat
x86_64/amd/zen2
for repo eessi.io-2023.06-software
x86_64/amd/zen3
for repo eessi-hpc.org-2023.06-compat
x86_64/amd/zen3
for repo eessi-hpc.org-2023.06-software
x86_64/amd/zen3
for repo eessi.io-2023.06-compat
x86_64/amd/zen3
for repo eessi.io-2023.06-software
aarch64/generic
for repo eessi-hpc.org-2023.06-compat
aarch64/generic
for repo eessi-hpc.org-2023.06-software
aarch64/generic
for repo eessi.io-2023.06-compat
aarch64/generic
for repo eessi.io-2023.06-software
aarch64/neoverse_n1
for repo eessi-hpc.org-2023.06-compat
aarch64/neoverse_n1
for repo eessi-hpc.org-2023.06-software
aarch64/neoverse_n1
for repo eessi.io-2023.06-compat
aarch64/neoverse_n1
for repo eessi.io-2023.06-software
aarch64/neoverse_v1
for repo eessi-hpc.org-2023.06-compat
aarch64/neoverse_v1
for repo eessi-hpc.org-2023.06-software
aarch64/neoverse_v1
for repo eessi.io-2023.06-compat
aarch64/neoverse_v1
for repo eessi.io-2023.06-software
Instance eessi-bot-mc-azure
is configured to build:
x86_64/amd/zen4
for repo eessi-hpc.org-2023.06-compat
x86_64/amd/zen4
for repo eessi-hpc.org-2023.06-software
x86_64/amd/zen4
for repo eessi.io-2023.06-compat
x86_64/amd/zen4
for repo eessi.io-2023.06-software
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3
eessi-bot-mc-aws
(click for details)eessi-bot-mc-azure
(click for details)New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_585/11283 |
date | job status | comment |
---|---|---|---|
May 23 09:23:40 UTC 2024 | submitted | job id 11283 awaits release by job manager |
|
May 23 09:24:02 UTC 2024 | released | job awaits launch by Slurm scheduler | |
May 23 09:28:04 UTC 2024 | running | job 11283 is running |
|
May 23 09:33:17 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
|
May 23 09:33:17 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
== No easyconfigs left to be built.
ERROR: Missing dependencies: SentencePiece/0.2.0-foss-2023a, SoX/14.4.2-foss-2023a (no easyconfig file or existing module found)
== Build succeeded for 0 out of 0
>> download succeeded: https://github.com/easybuilders/easybuild-easyconfigs/archive/7124863ed588066e5a988b4073d91381497a7f64.tar.gz
>> running command:
[started at: 2024-05-23 09:28:34]
[working dir: /tmp/eb-dlj1ws2x/eb-9tn8fu3_/tmpp3me5uio/easybuilders]
[output logged in /tmp/eb-dlj1ws2x/eb-9tn8fu3_/easybuild-run_cmd-t6inmlw4.log]
tar xzf /tmp/eb-dlj1ws2x/eb-9tn8fu3_/tmpp3me5uio/easybuilders/7124863ed588066e5a988b4073d91381497a7f64.tar.gz
>> command completed: exit 0, ran in 00h00m01s
== found valid index for /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/EasyBuild/4.9.1/easybuild/easyconfigs, so using it...
== Running parse hook for PyTorch-bundle-2.1.2-foss-2023a.eb...
== Running parse hook for foss-2023a.eb...
== resolving dependencies ...
== Running parse hook for parameterized-0.9.0-GCCcore-12.3.0.eb...
== Running parse hook for GCCcore-12.3.0.eb...
== Running parse hook for GCCcore-12.3.0.eb...
== Running parse hook for scikit-image-0.22.0-foss-2023a.eb...
== Running parse hook for librosa-0.10.1-foss-2023a.eb...
== Running parse hook for imageio-2.33.1-gfbf-2023a.eb...
== Running parse hook for gfbf-2023a.eb...
== Running parse hook for gfbf-2023a.eb...
== Running parse hook for GCC-12.3.0.eb...
== Running parse hook for FlexiBLAS-3.3.1-GCC-12.3.0.eb...
== Running parse hook for GCC-12.3.0.eb...
== Running parse hook for FFTW-3.3.10-GCC-12.3.0.eb...
== Running parse hook for NLTK-3.8.1-foss-2023a.eb...
== Running parse hook for numba-0.58.1-foss-2023a.eb...
== Running parse hook for Scalene-1.5.26-GCCcore-12.3.0.eb...
== Running parse hook for tqdm-4.66.1-GCCcore-12.3.0.eb...
== Running parse hook for LLVM-14.0.6-GCCcore-12.3.0-llvmlite.eb...
== Running parse hook for tensorboard-2.15.1-gfbf-2023a.eb...
I guess that with --from-pr
we got SentencePiece
and Sox
correctly since they were already in develop, but with --from-commit
we don't? Should I combine multiple --from-commit
's for each of those (i.e. look up the commit that provided the required SentencePiece
, etc)?
== No easyconfigs left to be built. ERROR: Missing dependencies: SentencePiece/0.2.0-foss-2023a, SoX/14.4.2-foss-2023a (no easyconfig file or existing module found) == Build succeeded for 0 out of 0 >> download succeeded: https://github.com/easybuilders/easybuild-easyconfigs/archive/7124863ed588066e5a988b4073d91381497a7f64.tar.gz >> running command: [started at: 2024-05-23 09:28:34] [working dir: /tmp/eb-dlj1ws2x/eb-9tn8fu3_/tmpp3me5uio/easybuilders] [output logged in /tmp/eb-dlj1ws2x/eb-9tn8fu3_/easybuild-run_cmd-t6inmlw4.log] tar xzf /tmp/eb-dlj1ws2x/eb-9tn8fu3_/tmpp3me5uio/easybuilders/7124863ed588066e5a988b4073d91381497a7f64.tar.gz >> command completed: exit 0, ran in 00h00m01s == found valid index for /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/EasyBuild/4.9.1/easybuild/easyconfigs, so using it... == Running parse hook for PyTorch-bundle-2.1.2-foss-2023a.eb... == Running parse hook for foss-2023a.eb... == resolving dependencies ... == Running parse hook for parameterized-0.9.0-GCCcore-12.3.0.eb... == Running parse hook for GCCcore-12.3.0.eb... == Running parse hook for GCCcore-12.3.0.eb... == Running parse hook for scikit-image-0.22.0-foss-2023a.eb... == Running parse hook for librosa-0.10.1-foss-2023a.eb... == Running parse hook for imageio-2.33.1-gfbf-2023a.eb... == Running parse hook for gfbf-2023a.eb... == Running parse hook for gfbf-2023a.eb... == Running parse hook for GCC-12.3.0.eb... == Running parse hook for FlexiBLAS-3.3.1-GCC-12.3.0.eb... == Running parse hook for GCC-12.3.0.eb... == Running parse hook for FFTW-3.3.10-GCC-12.3.0.eb... == Running parse hook for NLTK-3.8.1-foss-2023a.eb... == Running parse hook for numba-0.58.1-foss-2023a.eb... == Running parse hook for Scalene-1.5.26-GCCcore-12.3.0.eb... == Running parse hook for tqdm-4.66.1-GCCcore-12.3.0.eb... == Running parse hook for LLVM-14.0.6-GCCcore-12.3.0-llvmlite.eb... == Running parse hook for tensorboard-2.15.1-gfbf-2023a.eb...
I guess that with
--from-pr
we gotSentencePiece
andSox
correctly since they were already in develop, but with--from-commit
we don't? Should I combine multiple--from-commit
's for each of those (i.e. look up the commit that provided the requiredSentencePiece
, etc)?
I (and @trz42 and @ocaisa ) also saw issues with using --from-commit
, see for instance https://github.com/EESSI/software-layer/pull/558#issuecomment-2090836084.
Could you try using the merge commit (see bottom of the PR: 04ccd901a613631b00ccbe504d6d66d6a6c2febb) and check if that does work?
I tried manually
eb -D PyTorch-bundle-2.1.2-foss-2023a-CUDA-12.1.1.eb --from-commit 04ccd901a613631b00ccbe504d6d66d6a6c2febb
But that still shows missing EasyConfigs.
I tried manually
eb -D PyTorch-bundle-2.1.2-foss-2023a-CUDA-12.1.1.eb --from-commit 04ccd901a613631b00ccbe504d6d66d6a6c2febb
But that still shows missing EasyConfigs.
Guess we need to stick to --from-pr
then until we find a solution for this...
I was being stupid. I made a mistake in what I ran manually: that's with CUDA. That's not included in that PR/commit for sure... :P However,
eb -D PyTorch-bundle-2.1.2-foss-2023a.eb --from-commit 04ccd901a613631b00ccbe504d6d66d6a6c2febb
shows the same missing easyconfigs. I've switched to --from-pr
for now. I'll try to create an upstream issue on EasyBuild later (if there isn't any yet).
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3
eessi-bot-mc-aws
(click for details)eessi-bot-mc-azure
(click for details)New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_585/11288 |
date | job status | comment |
---|---|---|---|
May 23 11:50:20 UTC 2024 | submitted | job id 11288 awaits release by job manager |
|
May 23 11:50:42 UTC 2024 | released | job awaits launch by Slurm scheduler | |
May 23 11:55:44 UTC 2024 | running | job 11288 is running |
|
May 23 12:23:21 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
|
May 23 12:23:21 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
This is the actual failure:
== 2024-05-23 12:17:16,011 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/tools/build_log.py:111 in caller_info): Sanity check failed: extensions sanity check failed for 1 extensions: soundfile
failing sanity check for 'soundfile' extension: command "python -c "import soundfile"" failed; output:
Traceback (most recent call last):
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 161, in <module>
import _soundfile_data # ImportError if this doesn't exist
^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named '_soundfile_data'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 171, in <module>
_snd = _ffi.dlopen(_libname)
^^^^^^^^^^^^^^^^^^^^^
OSError: cannot load library 'libsndfile.so.1': libsndfile.so.1: cannot open shared object file: No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 192, in <module>
_snd = _ffi.dlopen(_explicit_libname)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: cannot load library 'libsndfile.so': libsndfile.so: cannot open shared object file: No such file or directory, (at easybuild/framework/easyblock.py:3669 in _sanity_check_step)
I guess this should be provide by the module libsndfile/1.2.2-GCCcore-12.3.0
, but I'm not sure what path's get searched by this dlopen
call. I think that searches LD_LIBRARY_PATH
, which we don't set in EESSI.
I guess this is a pretty fundamental question: how do we make dlopen
calls succesfully find libs from the EESSI software prefix?
See https://github.com/EESSI/software-layer/issues/192 , the Alliance have a solution for this
Spot on, it is indeed the issue of ctypes.util
's find_library
only returning the filename, not the full path. Or at least: I see that it is using find_library
here to ge tthe _libname
, which is then used as the dlopen
argument. I.e. I expect that if find_library
correctly returns the full path, the dlopen
call would have succeeded.
The downside is that the Alliance's solution looks quite involved... The upside is we can probably use their shadowing lib from https://github.com/ComputeCanada/custom_ctypes/tree/main/lib . What I don't fully understand is the sitecustomize
and ebpythonprefixes
stuff they do. Also, they seem to make a seperate module out of it, I'm not entirely sure why (do they only load it when they need to?).
I guess my main consideration would be if we shouldn't just always have this patched find_library
function in place. In that case, a simple patch to the installation that normally contains ctypes
(I guess that's in the standard Python installation?) would then be enough...
I was also thinking that maybe a patch on ctypes is enough, I don't fully understand all the other stuff going on with them
The changes they apply to ctypes
are quite small. See below for Python/3.11.3 Maybe we could apply these changes "in-place" in a build container to test if they solve the issue?
diff -u /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/ctypes/util.py custom_ctypes/lib/python3.11/site-packages/ctypes/util.py
--- /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/ctypes/util.py 2024-04-30 16:38:09.000000000 +0200
+++ custom_ctypes/lib/python3.11/site-packages/ctypes/util.py 2024-05-30 16:17:44.000000000 +0200
@@ -326,7 +326,10 @@
def find_library(name):
# See issue #9998
+ lib = _findLib_gcc(name)
+ # return absolute path
return _findSoname_ldconfig(name) or \
+ os.path.join(os.path.dirname(lib), _get_soname(lib)) or \
_get_soname(_findLib_gcc(name)) or _get_soname(_findLib_ld(name))
################################################################
I tried to replace the utils.py
globally (for all installations in https://github.com/NorESSI/software-layer/pull/387), but that leads to a failure when building/installing scikitimage
already (third package). See below for details. When I don't use the modified utils.py
it fails with the same error @casparvl has hit when building librosa
.
File "/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/ctypes/util.py", line 332, in find_library
os.path.join(os.path.dirname(lib), _get_soname(lib)) or \
^^^^^^^^^^^^^^^^^^^^
File "<frozen posixpath>", line 152, in dirname
TypeError: expected str, bytes or os.PathLike object, not NoneType
error: subprocess-exited-with-error
Will try to use that modified file only when building/using librosa
.
I've worked out a fix for the import soundfile
issue. See https://github.com/NorESSI/software-layer/pull/391
If it works out there, I'll test it with PyTorch-bundle. We can dicuss how we should employ this fix (maybe it's better to ship the custom ctypes
with EESSI, but for lack of better idea where to put it the above PR puts it under host_injections
).
I updated https://github.com/NorESSI/software-layer/pull/387 with the fixes in https://github.com/NorESSI/software-layer/pull/391 to work around the failing sanity check (python -c 'import soundfile'
). PyTorch (with CUDA) builds for x86_64/{generic,intel/skylake_avx512,amd/zen2}
. It fails for aarch64/generic
and x86_64/intel/broadwell
with a different issue. It could be worth applying the fixes also here and see which builds work (and which don't).
@trz42 I remember you said in a meeting that simply patching ctypes
caused issues in other packages. I think the idea was then to pick up a 'patched' ctypes
only for a specific phase of the build (the test step? I don't fully remember...). However, it was also brought up in that meeting that this fix would make the build pass, but users would still run into it at runtime, right?
I was thinking: what if we patch ctypes
to add a different API call. I.e. a find_library
with an extra argument full_path
(which defaults to false
, i.e. the default behaviour). And then, we patch librosa
to call find_library(..., full_path=true)
. That way, you only get the full path back if you intentionaly patch an application that depends on this find_library
call. That should have no unintended fallout (because the default function call retains it's prior behaviour of only returning the library name, not the full library path), while giving us an easy way to fix future similar issues (simply patch the function calls to find_library
to add the full_path=true
argument). It would also mean it is solved for these packages at runtime as well (we simply patched the package).
Now, this would be super annoying if there are packages that do a lot of find_library
calls, since it means a lot of patching. But I assume that should be pretty limited (I mean... how many external libraries can a single package use, right...? Or did I now jynx it :P)
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3
eessi-bot-mc-aws
(click for details)eessi-bot-mc-azure
(click for details)New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_585/15837 |
date | job status | comment |
---|---|---|---|
Aug 07 11:30:23 UTC 2024 | submitted | job id 15837 awaits release by job manager |
|
Aug 07 11:30:57 UTC 2024 | released | job awaits launch by Slurm scheduler | |
Aug 07 11:36:00 UTC 2024 | running | job 15837 is running |
|
Aug 07 12:38:08 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
|
Aug 07 12:38:08 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
=========================== short test summary info ============================
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-grace_hopper_517x606.jpg]
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-cmyk_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-gray_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-rgb_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-grace_hopper_517x606.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-cmyk_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-gray_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-rgb_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-grace_hopper_517x606.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-cmyk_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-gray_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-rgb_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg_errors - AssertionError: Regex pa...
FAILED test/test_image.py::test_decode_bad_huffman_images - RuntimeError: dec...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt.jpg] - Asserti...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt34_2.jpg] - Ass...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt34_3.jpg] - Ass...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt34_4.jpg] - Ass...
FAILED test/test_image.py::test_encode_jpeg_errors - AssertionError: Regex pa...
FAILED test/test_image.py::test_encode_jpeg[grace_hopper_517x606.jpg] - Runti...
FAILED test/test_image.py::test_write_jpeg[grace_hopper_517x606.jpg] - Runtim...
= 21 failed, 48811 passed, 50354 skipped, 2503 deselected, 2220 warnings in 965.82s (0:16:05) =
All of the failures look something like this:
=================================== FAILURES ===================================
___ test_decode_jpeg[None-ImageReadMode.UNCHANGED-grace_hopper_517x606.jpg] ____
test/test_image.py:94: in test_decode_jpeg
img_ljpeg = decode_image(data, mode=mode)
/tmp/eb-fwlstir4/eb-ghhapv8m/tmpxrxoma_b/lib/python3.11/site-packages/torchvision/io/image.py:236: in decode_image
output = torch.ops.image.decode_image(input, mode.value)
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/PyTorch/2.1.2-foss-2023a/lib/python3.11/site-packages/torch/_ops.py:692: in __call__
return self._op(*args, **kwargs or {})
E RuntimeError: decode_jpeg: torchvision not compiled with libjpeg support
@casparvl the torchvision issue should be fixed with an updated easyblock. Either use that or use EB v4.9.2 which should come with the fix. See #603
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3
eessi-bot-mc-aws
(click for details)eessi-bot-mc-azure
(click for details)New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_585/15896 |
date | job status | comment |
---|---|---|---|
Aug 08 13:33:12 UTC 2024 | submitted | job id 15896 awaits release by job manager |
|
Aug 08 13:33:51 UTC 2024 | released | job awaits launch by Slurm scheduler | |
Aug 08 13:39:54 UTC 2024 | running | job 15896 is running |
|
Aug 08 15:09:58 UTC 2024 | finished | :grin: SUCCESS (click triangle for details)
|
|
Aug 08 15:09:58 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
bot: build arch:aarch64/generic repo:eessi.io-2023.06-software
eessi-bot-mc-aws
(click for details)boegel-bot-deucalion
(click for details)eessi-bot-mc-azure
(click for details)error: patch failed: easystacks/software.eessi.io/2023.06/eessi-2023.06-eb-4.9.2-2023a.yml:41 error: easystacks/software.eessi.io/2023.06/eessi-2023.06-eb-4.9.2-2023a.yml: patch does not apply
Unable to download or merge changes between the source branch and the destination branch.Tip: This can usually be resolved by syncing your branch and resolving any merge conflicts.
bot: build arch:aarch64/generic repo:eessi.io-2023.06-software
boegel-bot-deucalion
(click for details)eessi-bot-mc-aws
(click for details)eessi-bot-mc-azure
(click for details)New job on instance eessi-bot-mc-aws for architecture aarch64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_585/16164 |
date | job status | comment |
---|---|---|---|
Aug 13 09:58:07 UTC 2024 | submitted | job id 16164 awaits release by job manager |
|
Aug 13 09:58:41 UTC 2024 | released | job awaits launch by Slurm scheduler | |
Aug 13 10:03:44 UTC 2024 | running | job 16164 is running |
|
Aug 13 10:56:58 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
|
Aug 13 10:56:58 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
Yep, reproduced @trz42 's issue:
== 2024-08-13 10:17:49,272 easyblock.py:3632 WARNING Sanity check: sanity check command python -c 'import sentencepiece' exited with code 1 (output: Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/SentencePiece/0.2.0-GCC-12.3.0/lib/python3.11/site-packages/sentencepiece/__init__.py", line 10, in <module>
from . import _sentencepiece
ImportError: /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so.4: cannot allocate memory in static TLS block
)
Related: https://github.com/gluster/glusterfs/issues/2913 and probably why @trz42 disabled it in his patch :)
But, maybe we should dig into a solution without disabling it, as now we get segmentation fault
s in the tests, which might be related to disabling the use of tcmalloc
...
I think this might put us on track for a solution: maybe we should patch sentencepiece
to load libtcmalloc_minimal.so.4
earlier on. I've asked ChatGPT for help:
======= I get an error from libtcmalloc: 'cannot allocate memory in static TLS block'. Can you help me understand this error? What is static TLS?
The error message cannot allocate memory in static TLS block
typically occurs when there's an issue with memory allocation related to Thread-Local Storage (TLS) in a program. Let's break down the concepts involved to understand this error better.
Thread-Local Storage (TLS):
Static TLS vs. Dynamic TLS:
__thread
or thread_local
specifier in languages like C and C++. The size of the static TLS block is determined at compile time and allocated in the program’s binary.The error cannot allocate memory in static TLS block
usually occurs in the following scenarios:
Exceeding Static TLS Limit:
Library Compatibility Issues:
libtcmalloc
(a high-performance memory allocator) is not fully compatible with how another library uses TLS, this can cause the error.Platform Limitations:
Reduce TLS Usage:
Switch to Dynamic TLS:
Upgrade or Reconfigure Libraries:
Increase Static TLS Size:
Use an Alternative Memory Allocator:
libtcmalloc
, consider switching to a different memory allocator that is compatible with your setup.Check Compiler and Linker Options:
By understanding the underlying mechanisms of TLS and its limitations, you can diagnose and address the cannot allocate memory in static TLS block
error more effectively.
=======
All in all, not a bad answer.
Another nice source: this mentions the root cause. With more information here
sentencepiece
or (probably better) gperftools
on aarch64
architectures set an LD_PRELOAD
for this. It's not a very elegant solution, but would likely work.__init__
of sentencepiece to dlopen
the tcmalloc
library before doing anything else. An early load of tcmalloc
should hopefully ensure that enough static TLS
is available. Basically: this solutionThis seems to discuss an even more fundamental 'fix' at the glibc
level: https://bugzilla.redhat.com/show_bug.cgi?id=1871396 . I'm wondering if any of that ever made it in, and if e.g. a newer compat layer (with newer glibc
) would not even have this issue anymore...