Open trz42 opened 4 months ago
Instance eessi-bot-mc-aws
is configured to build:
x86_64/generic
for repo eessi-hpc.org-2023.06-compat
x86_64/generic
for repo eessi-hpc.org-2023.06-software
x86_64/generic
for repo eessi.io-2023.06-compat
x86_64/generic
for repo eessi.io-2023.06-software
x86_64/intel/haswell
for repo eessi-hpc.org-2023.06-compat
x86_64/intel/haswell
for repo eessi-hpc.org-2023.06-software
x86_64/intel/haswell
for repo eessi.io-2023.06-compat
x86_64/intel/haswell
for repo eessi.io-2023.06-software
x86_64/intel/skylake_avx512
for repo eessi-hpc.org-2023.06-compat
x86_64/intel/skylake_avx512
for repo eessi-hpc.org-2023.06-software
x86_64/intel/skylake_avx512
for repo eessi.io-2023.06-compat
x86_64/intel/skylake_avx512
for repo eessi.io-2023.06-software
x86_64/amd/zen2
for repo eessi-hpc.org-2023.06-compat
x86_64/amd/zen2
for repo eessi-hpc.org-2023.06-software
x86_64/amd/zen2
for repo eessi.io-2023.06-compat
x86_64/amd/zen2
for repo eessi.io-2023.06-software
x86_64/amd/zen3
for repo eessi-hpc.org-2023.06-compat
x86_64/amd/zen3
for repo eessi-hpc.org-2023.06-software
x86_64/amd/zen3
for repo eessi.io-2023.06-compat
x86_64/amd/zen3
for repo eessi.io-2023.06-software
aarch64/generic
for repo eessi-hpc.org-2023.06-compat
aarch64/generic
for repo eessi-hpc.org-2023.06-software
aarch64/generic
for repo eessi.io-2023.06-compat
aarch64/generic
for repo eessi.io-2023.06-software
aarch64/neoverse_n1
for repo eessi-hpc.org-2023.06-compat
aarch64/neoverse_n1
for repo eessi-hpc.org-2023.06-software
aarch64/neoverse_n1
for repo eessi.io-2023.06-compat
aarch64/neoverse_n1
for repo eessi.io-2023.06-software
aarch64/neoverse_v1
for repo eessi-hpc.org-2023.06-compat
aarch64/neoverse_v1
for repo eessi-hpc.org-2023.06-software
aarch64/neoverse_v1
for repo eessi.io-2023.06-compat
aarch64/neoverse_v1
for repo eessi.io-2023.06-software
Instance eessi-bot-mc-azure
is configured to build:
x86_64/amd/zen4
for repo eessi-hpc.org-2023.06-compat
x86_64/amd/zen4
for repo eessi-hpc.org-2023.06-software
x86_64/amd/zen4
for repo eessi.io-2023.06-compat
x86_64/amd/zen4
for repo eessi.io-2023.06-software
bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2
eessi-bot-mc-aws
(click for details)eessi-bot-mc-azure
(click for details)New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/10940 |
date | job status | comment |
---|---|---|---|
May 17 09:26:27 UTC 2024 | submitted | job id 10940 awaits release by job manager |
|
May 17 09:27:22 UTC 2024 | released | job awaits launch by Slurm scheduler | |
May 17 09:32:24 UTC 2024 | running | job 10940 is running |
|
May 17 09:40:32 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
|
May 17 09:40:32 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
Retry after fixing args to cuDNN
install script...
bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2
eessi-bot-mc-aws
(click for details)eessi-bot-mc-azure
(click for details)New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/10941 |
date | job status | comment |
---|---|---|---|
May 17 10:45:01 UTC 2024 | submitted | job id 10941 awaits release by job manager |
|
May 17 10:45:40 UTC 2024 | released | job awaits launch by Slurm scheduler | |
May 17 10:49:42 UTC 2024 | running | job 10941 is running |
|
May 17 10:59:52 UTC 2024 | finished | :grin: SUCCESS (click triangle for details)
|
|
May 17 10:59:52 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
@trz42 The installation looks suspiciously large at 700MB, are you sure your hook is cleaning out the files it should?
@trz42 The installation looks suspiciously large at 700MB, are you sure your hook is cleaning out the files it should?
Full package is 1.4 GB.
Rebuild after changing hook function that handles dependencies and creates modluafooter entries...
bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2
eessi-bot-mc-aws
(click for details)eessi-bot-mc-azure
(click for details)New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/10942 |
date | job status | comment |
---|---|---|---|
May 17 12:54:38 UTC 2024 | submitted | job id 10942 awaits release by job manager |
|
May 17 12:55:03 UTC 2024 | released | job awaits launch by Slurm scheduler | |
May 17 13:00:06 UTC 2024 | running | job 10942 is running |
|
May 17 13:05:11 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
|
May 17 13:05:11 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
One more time...
bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2
eessi-bot-mc-aws
(click for details)eessi-bot-mc-azure
(click for details)New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/10943 |
date | job status | comment |
---|---|---|---|
May 17 13:14:32 UTC 2024 | submitted | job id 10943 awaits release by job manager |
|
May 17 13:15:15 UTC 2024 | released | job awaits launch by Slurm scheduler | |
May 17 13:16:17 UTC 2024 | running | job 10943 is running |
|
May 17 13:24:26 UTC 2024 | finished | :grin: SUCCESS (click triangle for details)
|
|
May 17 13:24:26 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
@trz42 I will take your updated host_injections
script for a test drive tomorrow, I think I have a few suggestions there and will open a PR to your branch
I also get the feeling that if we are going to move to easystack files (a good idea) then we should probably ship the ones we expect people to use
@trz42 I will take your updated
host_injections
script for a test drive tomorrow, I think I have a few suggestions there and will open a PR to your branch
Just updated the script with some improvements/fixes after my own testing.
Run another build after several changes...
bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2
eessi-bot-mc-aws
(click for details)eessi-bot-mc-azure
(click for details)New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/11284 |
date | job status | comment |
---|---|---|---|
May 23 09:28:36 UTC 2024 | submitted | job id 11284 awaits release by job manager |
|
May 23 09:29:06 UTC 2024 | released | job awaits launch by Slurm scheduler | |
May 23 09:30:09 UTC 2024 | running | job 11284 is running |
|
May 23 09:42:29 UTC 2024 | finished | :grin: SUCCESS (click triangle for details)
|
|
May 23 09:42:29 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
requires:
720
Attempt to add cuDNN which is a dependency of other packages such as TensorFlow and PyTorch.
Major additions/changes:
scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
withscripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
CUDA
andcuDNN
packages under.../host_injections
EESSI-install-software.sh
scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
withscripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
to installCUDA
,cuDNN
under.../host_injections
eb_hooks.py
host_injections
with a common function (replace_non_distributable_files_with_symlinks
)post_sanitycheck_hook
which replaces files with symlinks into corresponding paths under.../host_injections
for all files that cannot be redistributedcuDNN
to a build dependency (seeinject_gpu_property
)create_lmodsitepackage.py
eessi_{cuda,cudnn}_enabled_load_hook
functions in a single one (eessi_cuda_and_libraries_enabled_load_hook
)install_scripts.sh
nvidia_files
)