EESSI / software-layer

Software layer of the EESSI project
https://eessi.github.io/docs/software_layer
GNU General Public License v2.0
23 stars 46 forks source link

{2023.06}[2023a] PyTorch-Bundle v2.1.2 #585

Open casparvl opened 4 months ago

casparvl commented 4 months ago
15 out of 137 required modules missing:

* parameterized/0.9.0-GCCcore-12.3.0 (parameterized-0.9.0-GCCcore-12.3.0.eb)
* tqdm/4.66.1-GCCcore-12.3.0 (tqdm-4.66.1-GCCcore-12.3.0.eb)
* LLVM/14.0.6-GCCcore-12.3.0-llvmlite (LLVM-14.0.6-GCCcore-12.3.0-llvmlite.eb)
* Scalene/1.5.26-GCCcore-12.3.0 (Scalene-1.5.26-GCCcore-12.3.0.eb)
* gperftools/2.12-GCCcore-12.3.0 (gperftools-2.12-GCCcore-12.3.0.eb)
* SentencePiece/0.2.0-GCC-12.3.0 (SentencePiece-0.2.0-GCC-12.3.0.eb)
* tensorboard/2.15.1-gfbf-2023a (tensorboard-2.15.1-gfbf-2023a.eb)
* imageio/2.33.1-gfbf-2023a (imageio-2.33.1-gfbf-2023a.eb)
* libmad/0.15.1b-GCCcore-12.3.0 (libmad-0.15.1b-GCCcore-12.3.0.eb)
* SoX/14.4.2-GCCcore-12.3.0 (SoX-14.4.2-GCCcore-12.3.0.eb)
* NLTK/3.8.1-foss-2023a (NLTK-3.8.1-foss-2023a.eb)
* numba/0.58.1-foss-2023a (numba-0.58.1-foss-2023a.eb)
* scikit-image/0.22.0-foss-2023a (scikit-image-0.22.0-foss-2023a.eb)
* librosa/0.10.1-foss-2023a (librosa-0.10.1-foss-2023a.eb)
* PyTorch-bundle/2.1.2-foss-2023a (PyTorch-bundle-2.1.2-foss-2023a.eb)
eessi-bot[bot] commented 4 months ago

Instance eessi-bot-mc-aws is configured to build:

eessi-bot[bot] commented 4 months ago

Instance eessi-bot-mc-azure is configured to build:

casparvl commented 4 months ago

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3

eessi-bot[bot] commented 4 months ago
Updates by the bot instance eessi-bot-mc-aws (click for details) - received bot command `build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3` from `casparvl` - expanded format: `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` - handling command `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` resulted in: - submitted job `11283`, for details & status see https://github.com/EESSI/software-layer/pull/585#issuecomment-2126640524
eessi-bot[bot] commented 4 months ago
Updates by the bot instance eessi-bot-mc-azure (click for details) - account `casparvl` has NO permission to send commands to the bot
eessi-bot[bot] commented 4 months ago
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_585/11283 date job status comment
May 23 09:23:40 UTC 2024 submitted job id 11283 awaits release by job manager
May 23 09:24:02 UTC 2024 released job awaits launch by Slurm scheduler
May 23 09:28:04 UTC 2024 running job 11283 is running
May 23 09:33:17 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-11283.out
:x: found message matching ERROR:
:white_check_mark: no message matching FAILED:
:x: found message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
May 23 09:33:17 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-11283.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case
casparvl commented 4 months ago
== No easyconfigs left to be built.
ERROR: Missing dependencies: SentencePiece/0.2.0-foss-2023a, SoX/14.4.2-foss-2023a (no easyconfig file or existing module found)
== Build succeeded for 0 out of 0
  >> download succeeded: https://github.com/easybuilders/easybuild-easyconfigs/archive/7124863ed588066e5a988b4073d91381497a7f64.tar.gz
  >> running command:
        [started at: 2024-05-23 09:28:34]
        [working dir: /tmp/eb-dlj1ws2x/eb-9tn8fu3_/tmpp3me5uio/easybuilders]
        [output logged in /tmp/eb-dlj1ws2x/eb-9tn8fu3_/easybuild-run_cmd-t6inmlw4.log]
        tar xzf /tmp/eb-dlj1ws2x/eb-9tn8fu3_/tmpp3me5uio/easybuilders/7124863ed588066e5a988b4073d91381497a7f64.tar.gz
  >> command completed: exit 0, ran in 00h00m01s
== found valid index for /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/EasyBuild/4.9.1/easybuild/easyconfigs, so using it...
== Running parse hook for PyTorch-bundle-2.1.2-foss-2023a.eb...
== Running parse hook for foss-2023a.eb...
== resolving dependencies ...
== Running parse hook for parameterized-0.9.0-GCCcore-12.3.0.eb...
== Running parse hook for GCCcore-12.3.0.eb...
== Running parse hook for GCCcore-12.3.0.eb...
== Running parse hook for scikit-image-0.22.0-foss-2023a.eb...
== Running parse hook for librosa-0.10.1-foss-2023a.eb...
== Running parse hook for imageio-2.33.1-gfbf-2023a.eb...
== Running parse hook for gfbf-2023a.eb...
== Running parse hook for gfbf-2023a.eb...
== Running parse hook for GCC-12.3.0.eb...
== Running parse hook for FlexiBLAS-3.3.1-GCC-12.3.0.eb...
== Running parse hook for GCC-12.3.0.eb...
== Running parse hook for FFTW-3.3.10-GCC-12.3.0.eb...
== Running parse hook for NLTK-3.8.1-foss-2023a.eb...
== Running parse hook for numba-0.58.1-foss-2023a.eb...
== Running parse hook for Scalene-1.5.26-GCCcore-12.3.0.eb...
== Running parse hook for tqdm-4.66.1-GCCcore-12.3.0.eb...
== Running parse hook for LLVM-14.0.6-GCCcore-12.3.0-llvmlite.eb...
== Running parse hook for tensorboard-2.15.1-gfbf-2023a.eb...

I guess that with --from-pr we got SentencePiece and Sox correctly since they were already in develop, but with --from-commit we don't? Should I combine multiple --from-commit's for each of those (i.e. look up the commit that provided the required SentencePiece, etc)?

bedroge commented 4 months ago
== No easyconfigs left to be built.
ERROR: Missing dependencies: SentencePiece/0.2.0-foss-2023a, SoX/14.4.2-foss-2023a (no easyconfig file or existing module found)
== Build succeeded for 0 out of 0
  >> download succeeded: https://github.com/easybuilders/easybuild-easyconfigs/archive/7124863ed588066e5a988b4073d91381497a7f64.tar.gz
  >> running command:
        [started at: 2024-05-23 09:28:34]
        [working dir: /tmp/eb-dlj1ws2x/eb-9tn8fu3_/tmpp3me5uio/easybuilders]
        [output logged in /tmp/eb-dlj1ws2x/eb-9tn8fu3_/easybuild-run_cmd-t6inmlw4.log]
        tar xzf /tmp/eb-dlj1ws2x/eb-9tn8fu3_/tmpp3me5uio/easybuilders/7124863ed588066e5a988b4073d91381497a7f64.tar.gz
  >> command completed: exit 0, ran in 00h00m01s
== found valid index for /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/EasyBuild/4.9.1/easybuild/easyconfigs, so using it...
== Running parse hook for PyTorch-bundle-2.1.2-foss-2023a.eb...
== Running parse hook for foss-2023a.eb...
== resolving dependencies ...
== Running parse hook for parameterized-0.9.0-GCCcore-12.3.0.eb...
== Running parse hook for GCCcore-12.3.0.eb...
== Running parse hook for GCCcore-12.3.0.eb...
== Running parse hook for scikit-image-0.22.0-foss-2023a.eb...
== Running parse hook for librosa-0.10.1-foss-2023a.eb...
== Running parse hook for imageio-2.33.1-gfbf-2023a.eb...
== Running parse hook for gfbf-2023a.eb...
== Running parse hook for gfbf-2023a.eb...
== Running parse hook for GCC-12.3.0.eb...
== Running parse hook for FlexiBLAS-3.3.1-GCC-12.3.0.eb...
== Running parse hook for GCC-12.3.0.eb...
== Running parse hook for FFTW-3.3.10-GCC-12.3.0.eb...
== Running parse hook for NLTK-3.8.1-foss-2023a.eb...
== Running parse hook for numba-0.58.1-foss-2023a.eb...
== Running parse hook for Scalene-1.5.26-GCCcore-12.3.0.eb...
== Running parse hook for tqdm-4.66.1-GCCcore-12.3.0.eb...
== Running parse hook for LLVM-14.0.6-GCCcore-12.3.0-llvmlite.eb...
== Running parse hook for tensorboard-2.15.1-gfbf-2023a.eb...

I guess that with --from-pr we got SentencePiece and Sox correctly since they were already in develop, but with --from-commit we don't? Should I combine multiple --from-commit's for each of those (i.e. look up the commit that provided the required SentencePiece, etc)?

I (and @trz42 and @ocaisa ) also saw issues with using --from-commit, see for instance https://github.com/EESSI/software-layer/pull/558#issuecomment-2090836084.

bedroge commented 4 months ago

Could you try using the merge commit (see bottom of the PR: 04ccd901a613631b00ccbe504d6d66d6a6c2febb) and check if that does work?

casparvl commented 4 months ago

I tried manually

eb -D PyTorch-bundle-2.1.2-foss-2023a-CUDA-12.1.1.eb --from-commit 04ccd901a613631b00ccbe504d6d66d6a6c2febb

But that still shows missing EasyConfigs.

bedroge commented 4 months ago

I tried manually

eb -D PyTorch-bundle-2.1.2-foss-2023a-CUDA-12.1.1.eb --from-commit 04ccd901a613631b00ccbe504d6d66d6a6c2febb

But that still shows missing EasyConfigs.

Guess we need to stick to --from-pr then until we find a solution for this...

casparvl commented 4 months ago

I was being stupid. I made a mistake in what I ran manually: that's with CUDA. That's not included in that PR/commit for sure... :P However,

eb -D PyTorch-bundle-2.1.2-foss-2023a.eb --from-commit 04ccd901a613631b00ccbe504d6d66d6a6c2febb

shows the same missing easyconfigs. I've switched to --from-pr for now. I'll try to create an upstream issue on EasyBuild later (if there isn't any yet).

casparvl commented 4 months ago

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3

eessi-bot[bot] commented 4 months ago
Updates by the bot instance eessi-bot-mc-aws (click for details) - received bot command `build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3` from `casparvl` - expanded format: `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` - handling command `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` resulted in: - submitted job `11288`, for details & status see https://github.com/EESSI/software-layer/pull/585#issuecomment-2126916521
eessi-bot[bot] commented 4 months ago
Updates by the bot instance eessi-bot-mc-azure (click for details) - account `casparvl` has NO permission to send commands to the bot
eessi-bot[bot] commented 4 months ago
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_585/11288 date job status comment
May 23 11:50:20 UTC 2024 submitted job id 11288 awaits release by job manager
May 23 11:50:42 UTC 2024 released job awaits launch by Slurm scheduler
May 23 11:55:44 UTC 2024 running job 11288 is running
May 23 12:23:21 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-11288.out
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1716466678.tar.gzsize: 162 MiB (170601270 bytes)
entries: 6321
modules under _2023.06/software/linux/x8664/amd/zen3/modules/all
imageio/2.33.1-gfbf-2023a.lua
LLVM/14.0.6-GCCcore-12.3.0-llvmlite.lua
NLTK/3.8.1-foss-2023a.lua
numba/0.58.1-foss-2023a.lua
parameterized/0.9.0-GCCcore-12.3.0.lua
Scalene/1.5.26-GCCcore-12.3.0.lua
scikit-image/0.22.0-foss-2023a.lua
tqdm/4.66.1-GCCcore-12.3.0.lua
software under _2023.06/software/linux/x8664/amd/zen3/software
imageio/2.33.1-gfbf-2023a
LLVM/14.0.6-GCCcore-12.3.0-llvmlite
NLTK/3.8.1-foss-2023a
numba/0.58.1-foss-2023a
parameterized/0.9.0-GCCcore-12.3.0
Scalene/1.5.26-GCCcore-12.3.0
scikit-image/0.22.0-foss-2023a
tqdm/4.66.1-GCCcore-12.3.0
other under _2023.06/software/linux/x8664/amd/zen3
no other files in tarball
May 23 12:23:21 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-11288.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case
casparvl commented 4 months ago

This is the actual failure:

== 2024-05-23 12:17:16,011 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/tools/build_log.py:111 in caller_info): Sanity check failed: extensions sanity check failed for 1 extensions: soundfile
failing sanity check for 'soundfile' extension: command "python -c "import soundfile"" failed; output:
Traceback (most recent call last):
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 161, in <module>
    import _soundfile_data  # ImportError if this doesn't exist
    ^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named '_soundfile_data'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 171, in <module>
    _snd = _ffi.dlopen(_libname)
           ^^^^^^^^^^^^^^^^^^^^^
OSError: cannot load library 'libsndfile.so.1': libsndfile.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 192, in <module>
    _snd = _ffi.dlopen(_explicit_libname)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: cannot load library 'libsndfile.so': libsndfile.so: cannot open shared object file: No such file or directory,  (at easybuild/framework/easyblock.py:3669 in _sanity_check_step)

I guess this should be provide by the module libsndfile/1.2.2-GCCcore-12.3.0, but I'm not sure what path's get searched by this dlopen call. I think that searches LD_LIBRARY_PATH, which we don't set in EESSI.

I guess this is a pretty fundamental question: how do we make dlopen calls succesfully find libs from the EESSI software prefix?

ocaisa commented 4 months ago

See https://github.com/EESSI/software-layer/issues/192 , the Alliance have a solution for this

casparvl commented 4 months ago

Spot on, it is indeed the issue of ctypes.util's find_library only returning the filename, not the full path. Or at least: I see that it is using find_library here to ge tthe _libname, which is then used as the dlopen argument. I.e. I expect that if find_library correctly returns the full path, the dlopen call would have succeeded.

The downside is that the Alliance's solution looks quite involved... The upside is we can probably use their shadowing lib from https://github.com/ComputeCanada/custom_ctypes/tree/main/lib . What I don't fully understand is the sitecustomize and ebpythonprefixes stuff they do. Also, they seem to make a seperate module out of it, I'm not entirely sure why (do they only load it when they need to?).

I guess my main consideration would be if we shouldn't just always have this patched find_library function in place. In that case, a simple patch to the installation that normally contains ctypes (I guess that's in the standard Python installation?) would then be enough...

ocaisa commented 4 months ago

I was also thinking that maybe a patch on ctypes is enough, I don't fully understand all the other stuff going on with them

trz42 commented 4 months ago

The changes they apply to ctypes are quite small. See below for Python/3.11.3 Maybe we could apply these changes "in-place" in a build container to test if they solve the issue?

diff -u /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/ctypes/util.py custom_ctypes/lib/python3.11/site-packages/ctypes/util.py
--- /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/ctypes/util.py      2024-04-30 16:38:09.000000000 +0200
+++ custom_ctypes/lib/python3.11/site-packages/ctypes/util.py   2024-05-30 16:17:44.000000000 +0200
@@ -326,7 +326,10 @@

         def find_library(name):
             # See issue #9998
+            lib = _findLib_gcc(name)
+            # return absolute path
             return _findSoname_ldconfig(name) or \
+                    os.path.join(os.path.dirname(lib), _get_soname(lib)) or \
                    _get_soname(_findLib_gcc(name)) or _get_soname(_findLib_ld(name))

 ################################################################
trz42 commented 4 months ago

I tried to replace the utils.py globally (for all installations in https://github.com/NorESSI/software-layer/pull/387), but that leads to a failure when building/installing scikitimage already (third package). See below for details. When I don't use the modified utils.py it fails with the same error @casparvl has hit when building librosa.

    File "/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/ctypes/util.py", line 332, in find_library
      os.path.join(os.path.dirname(lib), _get_soname(lib)) or \
                   ^^^^^^^^^^^^^^^^^^^^
    File "<frozen posixpath>", line 152, in dirname
  TypeError: expected str, bytes or os.PathLike object, not NoneType
  error: subprocess-exited-with-error

Will try to use that modified file only when building/using librosa.

trz42 commented 3 months ago

I've worked out a fix for the import soundfile issue. See https://github.com/NorESSI/software-layer/pull/391

If it works out there, I'll test it with PyTorch-bundle. We can dicuss how we should employ this fix (maybe it's better to ship the custom ctypes with EESSI, but for lack of better idea where to put it the above PR puts it under host_injections).

trz42 commented 3 months ago

I updated https://github.com/NorESSI/software-layer/pull/387 with the fixes in https://github.com/NorESSI/software-layer/pull/391 to work around the failing sanity check (python -c 'import soundfile'). PyTorch (with CUDA) builds for x86_64/{generic,intel/skylake_avx512,amd/zen2}. It fails for aarch64/generic and x86_64/intel/broadwell with a different issue. It could be worth applying the fixes also here and see which builds work (and which don't).

casparvl commented 3 months ago

@trz42 I remember you said in a meeting that simply patching ctypes caused issues in other packages. I think the idea was then to pick up a 'patched' ctypes only for a specific phase of the build (the test step? I don't fully remember...). However, it was also brought up in that meeting that this fix would make the build pass, but users would still run into it at runtime, right?

I was thinking: what if we patch ctypes to add a different API call. I.e. a find_library with an extra argument full_path (which defaults to false, i.e. the default behaviour). And then, we patch librosa to call find_library(..., full_path=true). That way, you only get the full path back if you intentionaly patch an application that depends on this find_library call. That should have no unintended fallout (because the default function call retains it's prior behaviour of only returning the library name, not the full library path), while giving us an easy way to fix future similar issues (simply patch the function calls to find_library to add the full_path=true argument). It would also mean it is solved for these packages at runtime as well (we simply patched the package).

Now, this would be super annoying if there are packages that do a lot of find_library calls, since it means a lot of patching. But I assume that should be pretty limited (I mean... how many external libraries can a single package use, right...? Or did I now jynx it :P)

ocaisa commented 1 month ago

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3

eessi-bot[bot] commented 1 month ago
Updates by the bot instance eessi-bot-mc-aws (click for details) - received bot command `build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3` from `ocaisa` - expanded format: `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` - handling command `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` resulted in: - submitted job `15837`, for details & status see https://github.com/EESSI/software-layer/pull/585#issuecomment-2273252775
eessi-bot[bot] commented 1 month ago
Updates by the bot instance eessi-bot-mc-azure (click for details) - received bot command `build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3` from `ocaisa` - expanded format: `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` - handling command `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` resulted in: - no jobs were submitted
eessi-bot[bot] commented 1 month ago
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_585/15837 date job status comment
Aug 07 11:30:23 UTC 2024 submitted job id 15837 awaits release by job manager
Aug 07 11:30:57 UTC 2024 released job awaits launch by Slurm scheduler
Aug 07 11:36:00 UTC 2024 running job 15837 is running
Aug 07 12:38:08 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-15837.out
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1723033285.tar.gzsize: 144 MiB (151384274 bytes)
entries: 4814
modules under _2023.06/software/linux/x8664/amd/zen3/modules/all
gperftools/2.12-GCCcore-12.3.0.lua
imageio/2.33.1-gfbf-2023a.lua
libmad/0.15.1b-GCCcore-12.3.0.lua
NLTK/3.8.1-foss-2023a.lua
parameterized/0.9.0-GCCcore-12.3.0.lua
Scalene/1.5.26-GCCcore-12.3.0.lua
scikit-image/0.22.0-foss-2023a.lua
SentencePiece/0.2.0-GCC-12.3.0.lua
SoX/14.4.2-GCCcore-12.3.0.lua
tensorboard/2.15.1-gfbf-2023a.lua
tqdm/4.66.1-GCCcore-12.3.0.lua
software under _2023.06/software/linux/x8664/amd/zen3/software
gperftools/2.12-GCCcore-12.3.0
imageio/2.33.1-gfbf-2023a
libmad/0.15.1b-GCCcore-12.3.0
NLTK/3.8.1-foss-2023a
parameterized/0.9.0-GCCcore-12.3.0
Scalene/1.5.26-GCCcore-12.3.0
scikit-image/0.22.0-foss-2023a
SentencePiece/0.2.0-GCC-12.3.0
SoX/14.4.2-GCCcore-12.3.0
tensorboard/2.15.1-gfbf-2023a
tqdm/4.66.1-GCCcore-12.3.0
other under _2023.06/software/linux/x8664/amd/zen3
no other files in tarball
Aug 07 12:38:08 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 17/17 test case(s) from 17 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-15837.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case
casparvl commented 1 month ago
=========================== short test summary info ============================
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-grace_hopper_517x606.jpg]
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-cmyk_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-gray_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-rgb_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-grace_hopper_517x606.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-cmyk_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-gray_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-rgb_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-grace_hopper_517x606.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-cmyk_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-gray_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-rgb_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg_errors - AssertionError: Regex pa...
FAILED test/test_image.py::test_decode_bad_huffman_images - RuntimeError: dec...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt.jpg] - Asserti...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt34_2.jpg] - Ass...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt34_3.jpg] - Ass...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt34_4.jpg] - Ass...
FAILED test/test_image.py::test_encode_jpeg_errors - AssertionError: Regex pa...
FAILED test/test_image.py::test_encode_jpeg[grace_hopper_517x606.jpg] - Runti...
FAILED test/test_image.py::test_write_jpeg[grace_hopper_517x606.jpg] - Runtim...
= 21 failed, 48811 passed, 50354 skipped, 2503 deselected, 2220 warnings in 965.82s (0:16:05) =

All of the failures look something like this:

=================================== FAILURES ===================================
___ test_decode_jpeg[None-ImageReadMode.UNCHANGED-grace_hopper_517x606.jpg] ____
test/test_image.py:94: in test_decode_jpeg
    img_ljpeg = decode_image(data, mode=mode)
/tmp/eb-fwlstir4/eb-ghhapv8m/tmpxrxoma_b/lib/python3.11/site-packages/torchvision/io/image.py:236: in decode_image
    output = torch.ops.image.decode_image(input, mode.value)
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/PyTorch/2.1.2-foss-2023a/lib/python3.11/site-packages/torch/_ops.py:692: in __call__
    return self._op(*args, **kwargs or {})
E   RuntimeError: decode_jpeg: torchvision not compiled with libjpeg support
trz42 commented 1 month ago

@casparvl the torchvision issue should be fixed with an updated easyblock. Either use that or use EB v4.9.2 which should come with the fix. See #603

casparvl commented 1 month ago

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3

eessi-bot[bot] commented 1 month ago
Updates by the bot instance eessi-bot-mc-aws (click for details) - received bot command `build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3` from `casparvl` - expanded format: `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` - handling command `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` resulted in: - submitted job `15896`, for details & status see https://github.com/EESSI/software-layer/pull/585#issuecomment-2275846639
eessi-bot[bot] commented 1 month ago
Updates by the bot instance eessi-bot-mc-azure (click for details) - received bot command `build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3` from `casparvl` - expanded format: `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` - handling command `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` resulted in: - no jobs were submitted
eessi-bot[bot] commented 1 month ago
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_585/15896 date job status comment
Aug 08 13:33:12 UTC 2024 submitted job id 15896 awaits release by job manager
Aug 08 13:33:51 UTC 2024 released job awaits launch by Slurm scheduler
Aug 08 13:39:54 UTC 2024 running job 15896 is running
Aug 08 15:09:58 UTC 2024 finished
:grin: SUCCESS (click triangle for details)
Details
:white_check_mark: job output file slurm-15896.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching FAILED:
:white_check_mark: no message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1723128859.tar.gzsize: 154 MiB (162331850 bytes)
entries: 6301
modules under _2023.06/software/linux/x8664/amd/zen3/modules/all
gperftools/2.12-GCCcore-12.3.0.lua
imageio/2.33.1-gfbf-2023a.lua
libmad/0.15.1b-GCCcore-12.3.0.lua
NLTK/3.8.1-foss-2023a.lua
parameterized/0.9.0-GCCcore-12.3.0.lua
PyTorch-bundle/2.1.2-foss-2023a.lua
Scalene/1.5.26-GCCcore-12.3.0.lua
scikit-image/0.22.0-foss-2023a.lua
SentencePiece/0.2.0-GCC-12.3.0.lua
SoX/14.4.2-GCCcore-12.3.0.lua
tensorboard/2.15.1-gfbf-2023a.lua
tqdm/4.66.1-GCCcore-12.3.0.lua
software under _2023.06/software/linux/x8664/amd/zen3/software
gperftools/2.12-GCCcore-12.3.0
imageio/2.33.1-gfbf-2023a
libmad/0.15.1b-GCCcore-12.3.0
NLTK/3.8.1-foss-2023a
parameterized/0.9.0-GCCcore-12.3.0
PyTorch-bundle/2.1.2-foss-2023a
Scalene/1.5.26-GCCcore-12.3.0
scikit-image/0.22.0-foss-2023a
SentencePiece/0.2.0-GCC-12.3.0
SoX/14.4.2-GCCcore-12.3.0
tensorboard/2.15.1-gfbf-2023a
tqdm/4.66.1-GCCcore-12.3.0
other under _2023.06/software/linux/x8664/amd/zen3
2023.06/init/eessi_archdetect.sh
Aug 08 15:09:58 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 17/17 test case(s) from 17 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-15896.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case
casparvl commented 1 month ago

bot: build arch:aarch64/generic repo:eessi.io-2023.06-software

eessi-bot[bot] commented 1 month ago
Updates by the bot instance eessi-bot-mc-aws (click for details) - received bot command `build arch:aarch64/generic repo:eessi.io-2023.06-software` from `casparvl` - expanded format: `build architecture:aarch64/generic repository:eessi.io-2023.06-software`
eessi-build-deploy-bot-deucalion[bot] commented 1 month ago
Updates by the bot instance boegel-bot-deucalion (click for details) - account `casparvl` has NO permission to send commands to the bot
eessi-bot[bot] commented 1 month ago
Updates by the bot instance eessi-bot-mc-azure (click for details) - received bot command `build arch:aarch64/generic repo:eessi.io-2023.06-software` from `casparvl` - expanded format: `build architecture:aarch64/generic repository:eessi.io-2023.06-software` - handling command `build architecture:aarch64/generic repository:eessi.io-2023.06-software` resulted in: - no jobs were submitted
eessi-bot[bot] commented 1 month ago

error: patch failed: easystacks/software.eessi.io/2023.06/eessi-2023.06-eb-4.9.2-2023a.yml:41 error: easystacks/software.eessi.io/2023.06/eessi-2023.06-eb-4.9.2-2023a.yml: patch does not apply Unable to download or merge changes between the source branch and the destination branch.Tip: This can usually be resolved by syncing your branch and resolving any merge conflicts.

casparvl commented 1 month ago

bot: build arch:aarch64/generic repo:eessi.io-2023.06-software

eessi-build-deploy-bot-deucalion[bot] commented 1 month ago
Updates by the bot instance boegel-bot-deucalion (click for details) - account `casparvl` has NO permission to send commands to the bot
eessi-bot[bot] commented 1 month ago
Updates by the bot instance eessi-bot-mc-aws (click for details) - received bot command `build arch:aarch64/generic repo:eessi.io-2023.06-software` from `casparvl` - expanded format: `build architecture:aarch64/generic repository:eessi.io-2023.06-software` - handling command `build architecture:aarch64/generic repository:eessi.io-2023.06-software` resulted in: - submitted job `16164`, for details & status see https://github.com/EESSI/software-layer/pull/585#issuecomment-2285856053
eessi-bot[bot] commented 1 month ago
Updates by the bot instance eessi-bot-mc-azure (click for details) - received bot command `build arch:aarch64/generic repo:eessi.io-2023.06-software` from `casparvl` - expanded format: `build architecture:aarch64/generic repository:eessi.io-2023.06-software` - handling command `build architecture:aarch64/generic repository:eessi.io-2023.06-software` resulted in: - no jobs were submitted
eessi-bot[bot] commented 1 month ago
New job on instance eessi-bot-mc-aws for architecture aarch64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_585/16164 date job status comment
Aug 13 09:58:07 UTC 2024 submitted job id 16164 awaits release by job manager
Aug 13 09:58:41 UTC 2024 released job awaits launch by Slurm scheduler
Aug 13 10:03:44 UTC 2024 running job 16164 is running
Aug 13 10:56:58 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-16164.out
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-generic-1723544421.tar.gzsize: 129 MiB (135599158 bytes)
entries: 4669
modules under 2023.06/software/linux/aarch64/generic/modules/all
gperftools/2.12-GCCcore-12.3.0.lua
imageio/2.33.1-gfbf-2023a.lua
NLTK/3.8.1-foss-2023a.lua
parameterized/0.9.0-GCCcore-12.3.0.lua
Scalene/1.5.26-GCCcore-12.3.0.lua
scikit-image/0.22.0-foss-2023a.lua
tensorboard/2.15.1-gfbf-2023a.lua
tqdm/4.66.1-GCCcore-12.3.0.lua
software under 2023.06/software/linux/aarch64/generic/software
gperftools/2.12-GCCcore-12.3.0
imageio/2.33.1-gfbf-2023a
NLTK/3.8.1-foss-2023a
parameterized/0.9.0-GCCcore-12.3.0
Scalene/1.5.26-GCCcore-12.3.0
scikit-image/0.22.0-foss-2023a
tensorboard/2.15.1-gfbf-2023a
tqdm/4.66.1-GCCcore-12.3.0
other under 2023.06/software/linux/aarch64/generic
no other files in tarball
Aug 13 10:56:58 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 17/17 test case(s) from 17 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-16164.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case
casparvl commented 1 month ago

Yep, reproduced @trz42 's issue:

== 2024-08-13 10:17:49,272 easyblock.py:3632 WARNING Sanity check: sanity check command python -c 'import sentencepiece' exited with code 1 (output: Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/SentencePiece/0.2.0-GCC-12.3.0/lib/python3.11/site-packages/sentencepiece/__init__.py", line 10, in <module>
    from . import _sentencepiece
ImportError: /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so.4: cannot allocate memory in static TLS block
)
casparvl commented 1 month ago

Related: https://github.com/gluster/glusterfs/issues/2913 and probably why @trz42 disabled it in his patch :)

But, maybe we should dig into a solution without disabling it, as now we get segmentation faults in the tests, which might be related to disabling the use of tcmalloc...

casparvl commented 1 month ago

I think this might put us on track for a solution: maybe we should patch sentencepiece to load libtcmalloc_minimal.so.4 earlier on. I've asked ChatGPT for help:

======= I get an error from libtcmalloc: 'cannot allocate memory in static TLS block'. Can you help me understand this error? What is static TLS?

The error message cannot allocate memory in static TLS block typically occurs when there's an issue with memory allocation related to Thread-Local Storage (TLS) in a program. Let's break down the concepts involved to understand this error better.

Understanding Thread-Local Storage (TLS)

  1. Thread-Local Storage (TLS):

    • TLS is a mechanism that allows threads to have their own copy of variables. Each thread maintains its own separate storage, enabling data to be thread-specific without interference from other threads.
    • This is particularly useful for variables that maintain state across function calls in a thread-safe manner without using global or static variables.
  2. Static TLS vs. Dynamic TLS:

    • Static TLS: This is allocated at program startup. It is used for variables that have been marked with the __thread or thread_local specifier in languages like C and C++. The size of the static TLS block is determined at compile time and allocated in the program’s binary.
    • Dynamic TLS: Allocated at runtime, allowing for flexibility when the exact amount of TLS needed isn't known until the program is running.

The Error Explained

The error cannot allocate memory in static TLS block usually occurs in the following scenarios:

  1. Exceeding Static TLS Limit:

    • Every process in a Linux system has a fixed size for its static TLS block. If you declare too many thread-local variables or if the size of these variables is too large, it might exceed the allocated static TLS block size. This results in the inability to allocate more thread-local storage statically.
  2. Library Compatibility Issues:

    • Sometimes, the error may be due to compatibility issues between different libraries and the way they handle TLS. For example, if a library like libtcmalloc (a high-performance memory allocator) is not fully compatible with how another library uses TLS, this can cause the error.
  3. Platform Limitations:

    • Some platforms have more restrictive limits on TLS size. For example, some older versions of glibc (GNU C Library) had limitations that affected how much static TLS could be allocated.

Solutions

  1. Reduce TLS Usage:

    • Review your code to ensure you’re not unnecessarily using TLS. Try to reduce the number of thread-local variables or their size.
  2. Switch to Dynamic TLS:

    • If possible, switch to dynamic TLS allocation for your variables. This can be achieved by changing compiler flags or the way you declare thread-local storage.
  3. Upgrade or Reconfigure Libraries:

    • Check if there are updates or patches available for the libraries you’re using that might fix the compatibility issues. Sometimes, newer versions of libraries have better handling of TLS.
  4. Increase Static TLS Size:

    • Depending on your system and libraries, it might be possible to increase the size of the static TLS block through compiler or linker options.
  5. Use an Alternative Memory Allocator:

    • If the issue is related to libtcmalloc, consider switching to a different memory allocator that is compatible with your setup.
  6. Check Compiler and Linker Options:

    • Ensure that your compiler and linker options are set up to handle the necessary TLS requirements for your application.

By understanding the underlying mechanisms of TLS and its limitations, you can diagnose and address the cannot allocate memory in static TLS block error more effectively.

=======

All in all, not a bad answer.

Another nice source: this mentions the root cause. With more information here

casparvl commented 1 month ago
casparvl commented 1 month ago

This seems to discuss an even more fundamental 'fix' at the glibc level: https://bugzilla.redhat.com/show_bug.cgi?id=1871396 . I'm wondering if any of that ever made it in, and if e.g. a newer compat layer (with newer glibc) would not even have this issue anymore...