EESSI / software-layer

Software layer of the EESSI project
https://eessi.github.io/docs/software_layer
GNU General Public License v2.0
23 stars 46 forks source link

{2023.06}[2023a] PyTorch-Bundle v2.1.2 #585

Open casparvl opened 4 months ago

casparvl commented 4 months ago
15 out of 137 required modules missing:

* parameterized/0.9.0-GCCcore-12.3.0 (parameterized-0.9.0-GCCcore-12.3.0.eb)
* tqdm/4.66.1-GCCcore-12.3.0 (tqdm-4.66.1-GCCcore-12.3.0.eb)
* LLVM/14.0.6-GCCcore-12.3.0-llvmlite (LLVM-14.0.6-GCCcore-12.3.0-llvmlite.eb)
* Scalene/1.5.26-GCCcore-12.3.0 (Scalene-1.5.26-GCCcore-12.3.0.eb)
* gperftools/2.12-GCCcore-12.3.0 (gperftools-2.12-GCCcore-12.3.0.eb)
* SentencePiece/0.2.0-GCC-12.3.0 (SentencePiece-0.2.0-GCC-12.3.0.eb)
* tensorboard/2.15.1-gfbf-2023a (tensorboard-2.15.1-gfbf-2023a.eb)
* imageio/2.33.1-gfbf-2023a (imageio-2.33.1-gfbf-2023a.eb)
* libmad/0.15.1b-GCCcore-12.3.0 (libmad-0.15.1b-GCCcore-12.3.0.eb)
* SoX/14.4.2-GCCcore-12.3.0 (SoX-14.4.2-GCCcore-12.3.0.eb)
* NLTK/3.8.1-foss-2023a (NLTK-3.8.1-foss-2023a.eb)
* numba/0.58.1-foss-2023a (numba-0.58.1-foss-2023a.eb)
* scikit-image/0.22.0-foss-2023a (scikit-image-0.22.0-foss-2023a.eb)
* librosa/0.10.1-foss-2023a (librosa-0.10.1-foss-2023a.eb)
* PyTorch-bundle/2.1.2-foss-2023a (PyTorch-bundle-2.1.2-foss-2023a.eb)
casparvl commented 1 month ago

More resources: https://www.akkadia.org/drepper/tls.pdf and https://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/Thread_002dLocal.html#Thread_002dLocal

casparvl commented 1 month ago

bot: build arch:aarch64/generic repo:eessi.io-2023.06-software

eessi-bot[bot] commented 1 month ago
Updates by the bot instance eessi-bot-mc-aws (click for details) - received bot command `build arch:aarch64/generic repo:eessi.io-2023.06-software` from `casparvl` - expanded format: `build architecture:aarch64/generic repository:eessi.io-2023.06-software` - handling command `build architecture:aarch64/generic repository:eessi.io-2023.06-software` resulted in: - submitted job `16776`, for details & status see https://github.com/EESSI/software-layer/pull/585#issuecomment-2302912876
eessi-build-deploy-bot-deucalion[bot] commented 1 month ago
Updates by the bot instance boegel-bot-deucalion (click for details) - account `casparvl` has NO permission to send commands to the bot
eessi-bot[bot] commented 1 month ago
Updates by the bot instance eessi-bot-mc-azure (click for details) - received bot command `build arch:aarch64/generic repo:eessi.io-2023.06-software` from `casparvl` - expanded format: `build architecture:aarch64/generic repository:eessi.io-2023.06-software` - handling command `build architecture:aarch64/generic repository:eessi.io-2023.06-software` resulted in: - no jobs were submitted
eessi-bot[bot] commented 1 month ago
New job on instance eessi-bot-mc-aws for architecture aarch64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_585/16776 date job status comment
Aug 21 20:00:14 UTC 2024 submitted job id 16776 awaits release by job manager
Aug 21 20:01:13 UTC 2024 released job awaits launch by Slurm scheduler
Aug 21 20:06:16 UTC 2024 running job 16776 is running
Aug 21 21:11:16 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-16776.out
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-generic-1724271747.tar.gzsize: 129 MiB (135613003 bytes)
entries: 4670
modules under 2023.06/software/linux/aarch64/generic/modules/all
gperftools/2.12-GCCcore-12.3.0.lua
imageio/2.33.1-gfbf-2023a.lua
NLTK/3.8.1-foss-2023a.lua
parameterized/0.9.0-GCCcore-12.3.0.lua
Scalene/1.5.26-GCCcore-12.3.0.lua
scikit-image/0.22.0-foss-2023a.lua
tensorboard/2.15.1-gfbf-2023a.lua
tqdm/4.66.1-GCCcore-12.3.0.lua
software under 2023.06/software/linux/aarch64/generic/software
gperftools/2.12-GCCcore-12.3.0
imageio/2.33.1-gfbf-2023a
NLTK/3.8.1-foss-2023a
parameterized/0.9.0-GCCcore-12.3.0
Scalene/1.5.26-GCCcore-12.3.0
scikit-image/0.22.0-foss-2023a
tensorboard/2.15.1-gfbf-2023a
tqdm/4.66.1-GCCcore-12.3.0
other under 2023.06/software/linux/aarch64/generic
2023.06/init/easybuild/eb_hooks.py
Aug 21 21:11:16 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 18/18 test case(s) from 18 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-16776.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case
casparvl commented 1 month ago

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3

eessi-build-deploy-bot-deucalion[bot] commented 1 month ago
Updates by the bot instance boegel-bot-deucalion (click for details) - account `casparvl` has NO permission to send commands to the bot
eessi-bot[bot] commented 1 month ago
Updates by the bot instance eessi-bot-mc-aws (click for details) - received bot command `build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3` from `casparvl` - expanded format: `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` - handling command `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` resulted in: - submitted job `16777`, for details & status see https://github.com/EESSI/software-layer/pull/585#issuecomment-2302926234
eessi-bot[bot] commented 1 month ago
Updates by the bot instance eessi-bot-mc-azure (click for details) - received bot command `build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3` from `casparvl` - expanded format: `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` - handling command `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` resulted in: - no jobs were submitted
eessi-bot[bot] commented 1 month ago
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_585/16777 date job status comment
Aug 21 20:09:04 UTC 2024 submitted job id 16777 awaits release by job manager
Aug 21 20:09:21 UTC 2024 released job awaits launch by Slurm scheduler
Aug 21 20:15:34 UTC 2024 running job 16777 is running
Aug 21 21:48:14 UTC 2024 finished
:grin: SUCCESS (click triangle for details)
Details
:white_check_mark: job output file slurm-16777.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching FAILED:
:white_check_mark: no message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1724275877.tar.gzsize: 154 MiB (162369041 bytes)
entries: 6301
modules under _2023.06/software/linux/x8664/amd/zen3/modules/all
gperftools/2.12-GCCcore-12.3.0.lua
imageio/2.33.1-gfbf-2023a.lua
libmad/0.15.1b-GCCcore-12.3.0.lua
NLTK/3.8.1-foss-2023a.lua
parameterized/0.9.0-GCCcore-12.3.0.lua
PyTorch-bundle/2.1.2-foss-2023a.lua
Scalene/1.5.26-GCCcore-12.3.0.lua
scikit-image/0.22.0-foss-2023a.lua
SentencePiece/0.2.0-GCC-12.3.0.lua
SoX/14.4.2-GCCcore-12.3.0.lua
tensorboard/2.15.1-gfbf-2023a.lua
tqdm/4.66.1-GCCcore-12.3.0.lua
software under _2023.06/software/linux/x8664/amd/zen3/software
gperftools/2.12-GCCcore-12.3.0
imageio/2.33.1-gfbf-2023a
libmad/0.15.1b-GCCcore-12.3.0
NLTK/3.8.1-foss-2023a
parameterized/0.9.0-GCCcore-12.3.0
PyTorch-bundle/2.1.2-foss-2023a
Scalene/1.5.26-GCCcore-12.3.0
scikit-image/0.22.0-foss-2023a
SentencePiece/0.2.0-GCC-12.3.0
SoX/14.4.2-GCCcore-12.3.0
tensorboard/2.15.1-gfbf-2023a
tqdm/4.66.1-GCCcore-12.3.0
other under _2023.06/software/linux/x8664/amd/zen3
2023.06/init/easybuild/eb_hooks.py
Aug 21 21:48:14 UTC 2024 test result
:cry: FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 20/20 test case(s) from 20 check(s) (2 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-16777.out
:x: found message matching ERROR:
:x: found message matching [\s*FAILED\s*].*Ran .* test case
casparvl commented 1 month ago

Hm, https://github.com/EESSI/software-layer/pull/585#issuecomment-2302912876 still fails with the same static TLS issue. I realize why though: the sanity check is run before generating the module file. It will generate a temporary module file for this step, but since the hook only gets applied when generating the final module file, it doesn't get applied here!

Fix should be relatively simple: make the hook apply earlier, as a pre_sanitycheck_hook.

casparvl commented 1 month ago

bot: build arch:aarch64/generic repo:eessi.io-2023.06-software

eessi-bot[bot] commented 1 month ago
Updates by the bot instance eessi-bot-mc-aws (click for details) - received bot command `build arch:aarch64/generic repo:eessi.io-2023.06-software` from `casparvl` - expanded format: `build architecture:aarch64/generic repository:eessi.io-2023.06-software` - handling command `build architecture:aarch64/generic repository:eessi.io-2023.06-software` resulted in: - submitted job `16826`, for details & status see https://github.com/EESSI/software-layer/pull/585#issuecomment-2304777028
eessi-build-deploy-bot-deucalion[bot] commented 1 month ago
Updates by the bot instance boegel-bot-deucalion (click for details) - account `casparvl` has NO permission to send commands to the bot
eessi-bot[bot] commented 1 month ago
Updates by the bot instance eessi-bot-mc-azure (click for details) - received bot command `build arch:aarch64/generic repo:eessi.io-2023.06-software` from `casparvl` - expanded format: `build architecture:aarch64/generic repository:eessi.io-2023.06-software` - handling command `build architecture:aarch64/generic repository:eessi.io-2023.06-software` resulted in: - no jobs were submitted
eessi-bot[bot] commented 1 month ago

New job on instance eessi-bot-mc-aws for architecture aarch64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_585/16826

date job status comment
Aug 22 14:11:24 UTC 2024 submitted job id 16826 awaits release by job manager
Aug 22 14:11:37 UTC 2024 released job awaits launch by Slurm scheduler
Aug 22 14:17:39 UTC 2024 running job 16826 is running
Aug 22 15:23:13 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-16826.out
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-generic-1724337267.tar.gzsize: 129 MiB (135615311 bytes)
entries: 4670
modules under 2023.06/software/linux/aarch64/generic/modules/all
gperftools/2.12-GCCcore-12.3.0.lua
imageio/2.33.1-gfbf-2023a.lua
NLTK/3.8.1-foss-2023a.lua
parameterized/0.9.0-GCCcore-12.3.0.lua
Scalene/1.5.26-GCCcore-12.3.0.lua
scikit-image/0.22.0-foss-2023a.lua
tensorboard/2.15.1-gfbf-2023a.lua
tqdm/4.66.1-GCCcore-12.3.0.lua
software under 2023.06/software/linux/aarch64/generic/software
gperftools/2.12-GCCcore-12.3.0
imageio/2.33.1-gfbf-2023a
NLTK/3.8.1-foss-2023a
parameterized/0.9.0-GCCcore-12.3.0
Scalene/1.5.26-GCCcore-12.3.0
scikit-image/0.22.0-foss-2023a
tensorboard/2.15.1-gfbf-2023a
tqdm/4.66.1-GCCcore-12.3.0
other under 2023.06/software/linux/aarch64/generic
2023.06/init/easybuild/eb_hooks.py
Aug 22 15:23:13 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 18/18 test case(s) from 18 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-16826.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case
casparvl commented 1 month ago

Failure of the test suite on x86_64 with:

FAILURE INFO for EESSI_PyTorch_torchvision_CPU %nn_model=resnet50 %scale=1_node %parallel_strategy=None %module_name=PyTorch-bundle/2.1.2-foss-2023a (run: 1/1)
  * Description: Benchmark that runs a selected torchvision model on synthetic data
  * System partition: BotBuildTests:default
  * Environment: default
  * Stage directory: /project/60006/SHARED/jobs/2024.08/pr_585/event_33c66470-5ff9-11ef-924c-fc9f4cfa4137/run_000/linux_x86_64_amd_zen3/eessi.io-2023.06-software/reframe_runs/stage/BotBuildTests/default/default/EESSI_PyTorch_torchvision_CPU_39d248a6
  * Node list:
  * Job type: local (id=None)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: []
  * Failing phase: setup
  * Rerun with '-n /39d248a6 -p default --system BotBuildTests:default -r'
  * Reason: attribute error: EESSI-test-suite/eessi/testsuite/utils.py:163: Processor information (num_cores_per_numa_node) missing. Check that processor information is either autodetected (see https://reframe-hpc.readthedocs.io/en/stable/configure.html#proc-autodetection), or manually set in the ReFrame configuration file (see https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#processor-info).
    raise AttributeError(msg)

Ok, we didn't define that in our template config file. Also, it is particular to newer versions of ReFrame. I'll create a PR that adds a new version of ReFrame and I'll create a PR that no longer uses hard-coded processor features, but autodetects them. The challenge is that with the local spawner, if we use a single config file, it doesn't have the specific partition we submitted to. But, we can get that from the job environment and inject it in the config. I'll do that in https://github.com/EESSI/software-layer/pull/682 and a new ReFrame in https://github.com/EESSI/software-layer/pull/708

trz42 commented 1 month ago

Copying some findings from Slack here:

To me it seems the problem is a combination of what EasyBuild uses to run commands (it uses /bin/bash) and that we currently set LD_PRELOAD too early via the modified module file. Below are a few examples illustrating what happens.

The original TLS (Thread-Local Storage) allocation error... (withou LD_PRELOAD, just running the import after loading Python, gperftools and setting PATH and PYTHONPATH to the build directory for SentencePiece)

bot@aarch64-generic-node3 /tmp/bot $ python -c 'import sentencepiece'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/SentencePiece/0.2.0-GCC-12.3.0/lib/python3.11/site-packages/sentencepiece/__init__.py", line 10, in <module>
    from . import _sentencepiece
ImportError: /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so.4: cannot allocate memory in static TLS block

With LD_PRELOAD this succeeds (same env otherwise)...

bot@aarch64-generic-node3 /tmp/bot $ LD_PRELOAD=/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so python -c 'import sentencepiece'

However, that is not how EasyBuild runs the sanitycheck command. It rather runs the following (which fails)...

bot@aarch64-generic-node3 /tmp/bot $ LD_PRELOAD=/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so /bin/bash -c "python -c 'import sentencepiece'"
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.36' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.35' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libgcc_s.so.1)
/bin/bash: /lib/aarch64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/GCCcore/12.3.0/lib64/libgcc_s.so.1)

The above error is what we got in the last build job for aarch64/generic. If we run the original command in a subshell (as EasyBuild does), we get the original error (just to illustrate that we "correctly" emulate what EasyBuild does)...

bot@aarch64-generic-node3 /tmp/bot $ /bin/bash -c "python -c 'import sentencepiece'"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/SentencePiece/0.2.0-GCC-12.3.0/lib/python3.11/site-packages/sentencepiece/__init__.py", line 10, in <module>
    from . import _sentencepiece
ImportError: /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so.4: cannot allocate memory in static TLS block

If we set LD_PRELOAD just before we run python, it works...

bot@aarch64-generic-node3 /tmp/bot $ /bin/bash -c "LD_PRELOAD=/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so python -c 'import sentencepiece'"

I think, setting LD_PRELOAD in the module for SentencePiece could work. However, when running EasyBuild we'll likely run into issues because it uses /bin/bash to run commands. If it would use bash from the compat layer it would work. See example below

bot@aarch64-generic-node3 /tmp/bot $ LD_PRELOAD=/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so.4 /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/bin/bash -c "python -c 'import sentencepiece'"

To me it seems that /bin/bash and /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so.4 depend on different symbols (which sounds logical), hence it is critical to only preload the latter library after /bin/bash's dependencies have been resolved.

boegel commented 1 month ago

@trz42 Doesn't this mean that EasyBuild should be using the /bin/bash from the compat layer, so prefixed with sysroot in EasyBuild lingo?

trz42 commented 1 month ago

@trz42 Doesn't this mean that EasyBuild should be using the /bin/bash from the compat layer, so prefixed with sysroot in EasyBuild lingo?

Maybe. If sysroot implies that it can expect a sysroot/bin/bash it could work. However, it has only resulted in a problem when we use LD_PRELOAD. So, maybe we should look for another solution.

I'm trying to solve the issue with a parse hook where I just add LD_PRELOAD=... in front of the failing sanity check command and another hook to add LD_PRELOAD=... in the module file. However, the latter has to be done after the sanity check has been run.

A better fix could be what you suggest, in some cases or always, we prefix the exec_cmd = "/bin/bash" (/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/EasyBuild/4.9.2/lib/python3.11/site-packages/easybuild/tools/run.py:229) with sysroot when it is present. Then we could just add the setting of LD_PRELOAD=... in the module file and it should work both while using the module and while running the sanity check.

casparvl commented 1 month ago

@trz42 Doesn't this mean that EasyBuild should be using the /bin/bash from the compat layer, so prefixed with sysroot in EasyBuild lingo?

To me, this makes a lot of sense actually. If you're explicitly invoking a shell to run your command, and if a sysroot is set, it should be the shell from that sysroot prefix imho.

What is the reason that EasyBuild is running this in a subshell actually? I mean that is not typically how I would test the module manually and could potentially lead to differences with running it in the parent shell (this example begin a case in point).

boegel commented 1 month ago

@casparvl All shell commands run by EasyBuild are run in a subshell...

boegel commented 1 month ago

A better fix could be what you suggest, in some cases or always, we prefix the exec_cmd = "/bin/bash" (/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/EasyBuild/4.9.2/lib/python3.11/site-packages/easybuild/tools/run.py:229) with sysroot when it is present. Then we could just add the setting of LD_PRELOAD=... in the module file and it should work both while using the module and while running the sanity check.

I think that's the right way forward...

It's a relatively easy change to make in EasyBuild (though in some sense a breaking one, so perhaps we need to make it configurable).

trz42 commented 1 month ago

We may even test this change already by copying the bash files from the two compat layers (x86_64 and aarch64) to some directory in the PR and then modify the launch of the containers such that the right file is bind mounted to /bin/bash inside the container. Before we run eessi_container.sh we can set SINGULARITY_BIND.

casparvl commented 1 week ago

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3

eessi-bot[bot] commented 1 week ago
Updates by the bot instance eessi-bot-mc-aws (click for details) - received bot command `build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3` from `casparvl` - expanded format: `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` - handling command `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` resulted in: - submitted job `18888`, for details & status see https://github.com/EESSI/software-layer/pull/585#issuecomment-2356919453
eessi-build-deploy-bot-deucalion[bot] commented 1 week ago
Updates by the bot instance boegel-bot-deucalion (click for details) - account `casparvl` has NO permission to send commands to the bot
eessi-bot[bot] commented 1 week ago
Updates by the bot instance eessi-bot-mc-azure (click for details) - received bot command `build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3` from `casparvl` - expanded format: `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` - handling command `build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3` resulted in: - no jobs were submitted
eessi-bot[bot] commented 1 week ago
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_585/18888 date job status comment
Sep 17 20:57:04 UTC 2024 submitted job id 18888 awaits release by job manager
Sep 17 20:57:37 UTC 2024 released job awaits launch by Slurm scheduler
Sep 17 21:04:40 UTC 2024 running job 18888 is running
Sep 17 22:23:56 UTC 2024 finished
:grin: SUCCESS (click triangle for details)
Details
:white_check_mark: job output file slurm-18888.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching FAILED:
:white_check_mark: no message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1726611642.tar.gzsize: 154 MiB (162214378 bytes)
entries: 6200
modules under _2023.06/software/linux/x8664/amd/zen3/modules/all
gperftools/2.12-GCCcore-12.3.0.lua
imageio/2.33.1-gfbf-2023a.lua
libmad/0.15.1b-GCCcore-12.3.0.lua
NLTK/3.8.1-foss-2023a.lua
parameterized/0.9.0-GCCcore-12.3.0.lua
PyTorch-bundle/2.1.2-foss-2023a.lua
Scalene/1.5.26-GCCcore-12.3.0.lua
scikit-image/0.22.0-foss-2023a.lua
SentencePiece/0.2.0-GCC-12.3.0.lua
SoX/14.4.2-GCCcore-12.3.0.lua
tensorboard/2.15.1-gfbf-2023a.lua
software under _2023.06/software/linux/x8664/amd/zen3/software
gperftools/2.12-GCCcore-12.3.0
imageio/2.33.1-gfbf-2023a
libmad/0.15.1b-GCCcore-12.3.0
NLTK/3.8.1-foss-2023a
parameterized/0.9.0-GCCcore-12.3.0
PyTorch-bundle/2.1.2-foss-2023a
Scalene/1.5.26-GCCcore-12.3.0
scikit-image/0.22.0-foss-2023a
SentencePiece/0.2.0-GCC-12.3.0
SoX/14.4.2-GCCcore-12.3.0
tensorboard/2.15.1-gfbf-2023a
other under _2023.06/software/linux/x8664/amd/zen3
2023.06/init/easybuild/eb_hooks.py
Sep 17 22:23:56 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 9/9 test case(s) from 9 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-18888.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case
casparvl commented 1 week ago

Ok, good that the test now works on x86_64.

For this issue on ARM, I made a fix https://github.com/easybuilders/easybuild-framework/pull/4646 for EasyBuild framework, only to realize afterwards that the whole run_cmd thing is completely overhauled in EasyBuild 5.0. Looking at the 5.0.X code here, I see:

    # use bash as shell instead of the default /bin/sh used by subprocess.run
    # (which could be dash instead of bash, like on Ubuntu, see https://wiki.ubuntu.com/DashAsBinSh)
    # stick to None (default value) when not running command via a shell
    if use_bash:
        bash = shutil.which('bash')
        _log.info(f"Path to bash that will be used to run shell commands: {bash}")
        executable, shell = bash, True
    else:
        executable, shell = None, False

I tested a build of SentencePiece, including the LD_PRLOAD hook:

eb --hooks $HOME/EESSI/software-layer/eb_hooks.py SentencePiece-0.2.0-GCC-12.3.0.eb --rebuild

with EasyBuild 5.0.X (from the current branch), and that worked without encountering the previous issue.

In other words, there is not much to fix, we just need to wait for EasyBuild 5.X to be released (soon, I hope :D). Or we need to reinstall 4.9.3 with a patch based on https://github.com/easybuilders/easybuild-framework/pull/4646 so we can proceed here.

casparvl commented 1 week ago

Hmm, while the issue for SentencePiece is solved (this now installs succesfully), I'm getting

  -- Check for working C compiler: /tmp/eb-cw54zzvr/tmprgti6_vm/rpath_wrappers/gcc_wrapper/gcc - broken
  CMake Error at /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/CMake/3.26.3-GCCcore-12.3.0/share/cmake-3.26/Modules/CMakeTestCCompiler.cmake:67 (message):
    The C compiler

      "/tmp/eb-cw54zzvr/tmprgti6_vm/rpath_wrappers/gcc_wrapper/gcc"

    is not able to compile a simple test program.

    It fails with the following output:

      Change Dir: /tmp/casparvl/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/build/temp.linux-aarch64-cpython-311/CMakeFiles/CMakeScratch/TryCompile-XrjNFV

      Run Build Command(s):/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Ninja/1.11.1-GCCcore-12.3.0/bin/ninja -v cmTC_64b77 && [1/2] /tmp/eb-cw54zzvr/tmprgti6_vm/rpath_wrappers/gcc_wrapper/gcc   -O
2 -ftree-vectorize -mcpu=native -fno-math-errno -o CMakeFiles/cmTC_64b77.dir/testCCompiler.c.o -c /tmp/casparvl/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/build/temp.linux-aarch64-cpython-311/CMakeFiles/CMakeScra
tch/TryCompile-XrjNFV/testCCompiler.c
      FAILED: CMakeFiles/cmTC_64b77.dir/testCCompiler.c.o
      /tmp/eb-cw54zzvr/tmprgti6_vm/rpath_wrappers/gcc_wrapper/gcc   -O2 -ftree-vectorize -mcpu=native -fno-math-errno -o CMakeFiles/cmTC_64b77.dir/testCCompiler.c.o -c /tmp/casparvl/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext
/text-0.16.2/build/temp.linux-aarch64-cpython-311/CMakeFiles/CMakeScratch/TryCompile-XrjNFV/testCCompiler.c
      /bin/sh: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /home/casparvl/eessi/versions/2023.06/software/linux/aarch64/neoverse_n1/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so)
      /bin/sh: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
      /bin/sh: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
      /bin/sh: /lib64/libc.so.6: version `GLIBC_2.36' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
      /bin/sh: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/GCCcore/12.3.0/lib64/libstdc++.so.6)
      /bin/sh: /lib64/libc.so.6: version `GLIBC_2.35' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/GCCcore/12.3.0/lib64/libgcc_s.so.1)
      /bin/sh: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/GCCcore/12.3.0/lib64/libgcc_s.so.1)
      ninja: build stopped: subcommand failed.

when it is installing torchtext from PyTorch-Bundle. I think the /bin/sh here comes from the fact that some python process invokes subprocess.run(), which uses /bin/sh according to https://github.com/easybuilders/easybuild-framework/blob/a2550eb8fab479f517badbf45925c3cebda2880c/easybuild/tools/run.py#L450

The last part of the stack trace I'm getting:

    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
      dist.run_commands()
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
      self.run_command(cmd)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/dist.py", line 1244, in run_command
      super().run_command(command)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/wheel/bdist_wheel.py", line 343, in run
      self.run_command("build")
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/dist.py", line 1244, in run_command
      super().run_command(command)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/command/build.py", line 131, in run
      self.run_command(cmd_name)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/dist.py", line 1244, in run_command
      super().run_command(command)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/tmp/casparvl/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/tools/setup_helpers/extension.py", line 46, in run
      super().run()
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 84, in run
      _build_ext.run(self)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
      _build_ext.build_ext.run(self)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
      self.build_extensions()
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
      _build_ext.build_ext.build_extensions(self)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
      self._build_extensions_serial()
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
      self.build_extension(ext)
    File "/tmp/casparvl/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/tools/setup_helpers/extension.py", line 108, in build_extension
      subprocess.check_call(["cmake", str(_ROOT_DIR)] + cmake_args, cwd=self.build_temp)
    File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/subprocess.py", line 413, in check_call
      raise CalledProcessError(retcode, cmd)

That's annoying to say the least. We can fix it, but it might require a patch to Python to alter which sh command is used by default by subprocess.run. Alternatively, we change the subprocess call done by /tmp/casparvl/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/tools/setup_helpers/extension.py. That's much smaller impact, but also a less complete fix. It means that any other software using SentencePiece and calling subprocess.run will still run into this issue.

boegel commented 1 week ago

In other words, there is not much to fix, we just need to wait for EasyBuild 5.X to be released (soon, I hope :D). Or we need to reinstall 4.9.3 with a patch based on easybuilders/easybuild-framework#4646 so we can proceed here.

@casparvl There's an EasyBuild v4.9.4 release coming really soon (in next couple of days), because the GCC easyblock in EasyBuild v4.9.3 has a serious bug that many people will easily run into (see here), so it's worth trying to get https://github.com/easybuilders/easybuild-framework/pull/4646 merged ASAP.

boegel commented 1 week ago

That's annoying to say the least. We can fix it, but it might require a patch to Python to alter which sh command is used by default by subprocess.run. Alternatively, we change the subprocess call done by /tmp/casparvl/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/tools/setup_helpers/extension.py. That's much smaller impact, but also a less complete fix. It means that any other software using SentencePiece and calling subprocess.run will still run into this issue.

@casparvl A patch to Python seems like the best way forward here. We should check what Gentoo Prefix does here, since they must have run into similar issues with a hardcoded /bin/sh?

casparvl commented 4 days ago

From the sources, it seems to be equally broken in Gentoo Prefix:

$ cat /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/lib/python3.11/subprocess.py | grep -A 5 "/bin/sh"
    >>> check_output(["/bin/sh", "-c",
    ...               "ls -l non_existent_file ; exit 0"],
    ...              stderr=STDOUT)
    b'ls: non_existent_file: No such file or directory\n'

    There is an additional optional argument, "input", allowing you to
--
                # On Android the default shell is at '/system/bin/sh'.
                unix_shell = ('/system/bin/sh' if
                          hasattr(sys, 'getandroidapilevel') else '/bin/sh')
                args = [unix_shell, "-c"] + args
                if executable:
                    args[0] = executable

            if executable is None:
casparvl commented 4 days ago

I confirmed that if I run a subprocess.run("sleep 5", shell=True) with the python from the compat layer, it will use /bin/sh to execute this command. So yes, it's just as broken in the Python in Gentoo-Prefix.

The fix should be very simple: prepend the sysroot to the path on this line in the source code. I guess this could (and should) be done at the EasyBlock level. I'll look at that later...