EESSI / software-layer

Software layer of the EESSI project
https://eessi.github.io/docs/software_layer
GNU General Public License v2.0
23 stars 46 forks source link

{2023.06}[system] cuDNN/8.9.2.26-CUDA-12.1.1 #581

Open trz42 opened 4 months ago

trz42 commented 4 months ago

requires:

Attempt to add cuDNN which is a dependency of other packages such as TensorFlow and PyTorch.

Major additions/changes:

eessi-bot[bot] commented 4 months ago

Instance eessi-bot-mc-aws is configured to build:

eessi-bot[bot] commented 4 months ago

Instance eessi-bot-mc-azure is configured to build:

trz42 commented 4 months ago

bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2

eessi-bot[bot] commented 4 months ago
Updates by the bot instance eessi-bot-mc-aws (click for details) - received bot command `build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2` from `trz42` - expanded format: `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` - handling command `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` resulted in: - submitted job `10940`, for details & status see https://github.com/EESSI/software-layer/pull/581#issuecomment-2117129261
eessi-bot[bot] commented 4 months ago
Updates by the bot instance eessi-bot-mc-azure (click for details) - received bot command `build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2` from `trz42` - expanded format: `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` - handling command `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` resulted in: - no jobs were submitted
eessi-bot[bot] commented 4 months ago
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/10940 date job status comment
May 17 09:26:27 UTC 2024 submitted job id 10940 awaits release by job manager
May 17 09:27:22 UTC 2024 released job awaits launch by Slurm scheduler
May 17 09:32:24 UTC 2024 running job 10940 is running
May 17 09:40:32 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-10940.out
:x: found message matching ERROR:
:white_check_mark: no message matching FAILED:
:white_check_mark: no message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1715938433.tar.gzsize: 698 MiB (732495131 bytes)
entries: 74
modules under _2023.06/software/linux/x8664/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under _2023.06/software/linux/x8664/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under _2023.06/software/linux/x8664/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cudnn_host_injections.sh
.lmod/SitePackage.lua
May 17 09:40:32 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-10940.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case
trz42 commented 4 months ago

Retry after fixing args to cuDNN install script...

bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2

eessi-bot[bot] commented 4 months ago
Updates by the bot instance eessi-bot-mc-aws (click for details) - received bot command `build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2` from `trz42` - expanded format: `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` - handling command `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` resulted in: - submitted job `10941`, for details & status see https://github.com/EESSI/software-layer/pull/581#issuecomment-2117292658
eessi-bot[bot] commented 4 months ago
Updates by the bot instance eessi-bot-mc-azure (click for details) - received bot command `build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2` from `trz42` - expanded format: `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` - handling command `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` resulted in: - no jobs were submitted
eessi-bot[bot] commented 4 months ago
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/10941 date job status comment
May 17 10:45:01 UTC 2024 submitted job id 10941 awaits release by job manager
May 17 10:45:40 UTC 2024 released job awaits launch by Slurm scheduler
May 17 10:49:42 UTC 2024 running job 10941 is running
May 17 10:59:52 UTC 2024 finished
:grin: SUCCESS (click triangle for details)
Details
:white_check_mark: job output file slurm-10941.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching FAILED:
:white_check_mark: no message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1715943174.tar.gzsize: 698 MiB (732493432 bytes)
entries: 74
modules under _2023.06/software/linux/x8664/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under _2023.06/software/linux/x8664/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under _2023.06/software/linux/x8664/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cudnn_host_injections.sh
.lmod/SitePackage.lua
May 17 10:59:52 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-10941.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case
ocaisa commented 4 months ago

@trz42 The installation looks suspiciously large at 700MB, are you sure your hook is cleaning out the files it should?

trz42 commented 4 months ago

@trz42 The installation looks suspiciously large at 700MB, are you sure your hook is cleaning out the files it should?

Full package is 1.4 GB.

trz42 commented 4 months ago

Rebuild after changing hook function that handles dependencies and creates modluafooter entries...

bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2

eessi-bot[bot] commented 4 months ago
Updates by the bot instance eessi-bot-mc-aws (click for details) - received bot command `build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2` from `trz42` - expanded format: `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` - handling command `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` resulted in: - submitted job `10942`, for details & status see https://github.com/EESSI/software-layer/pull/581#issuecomment-2117540885
eessi-bot[bot] commented 4 months ago
Updates by the bot instance eessi-bot-mc-azure (click for details) - received bot command `build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2` from `trz42` - expanded format: `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` - handling command `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` resulted in: - no jobs were submitted
eessi-bot[bot] commented 4 months ago
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/10942 date job status comment
May 17 12:54:38 UTC 2024 submitted job id 10942 awaits release by job manager
May 17 12:55:03 UTC 2024 released job awaits launch by Slurm scheduler
May 17 13:00:06 UTC 2024 running job 10942 is running
May 17 13:05:11 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-10942.out
:x: found message matching ERROR:
:white_check_mark: no message matching FAILED:
:white_check_mark: no message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1715950816.tar.gzsize: 0 MiB (15041 bytes)
entries: 3
modules under _2023.06/software/linux/x8664/amd/zen2/modules/all
no module files in tarball
software under _2023.06/software/linux/x8664/amd/zen2/software
no software packages in tarball
other under _2023.06/software/linux/x8664/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cudnn_host_injections.sh
.lmod/SitePackage.lua
May 17 13:05:11 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-10942.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case
trz42 commented 4 months ago

One more time...

bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2

eessi-bot[bot] commented 4 months ago
Updates by the bot instance eessi-bot-mc-aws (click for details) - received bot command `build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2` from `trz42` - expanded format: `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` - handling command `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` resulted in: - submitted job `10943`, for details & status see https://github.com/EESSI/software-layer/pull/581#issuecomment-2117581012
eessi-bot[bot] commented 4 months ago
Updates by the bot instance eessi-bot-mc-azure (click for details) - received bot command `build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2` from `trz42` - expanded format: `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` - handling command `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` resulted in: - no jobs were submitted
eessi-bot[bot] commented 4 months ago
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/10943 date job status comment
May 17 13:14:32 UTC 2024 submitted job id 10943 awaits release by job manager
May 17 13:15:15 UTC 2024 released job awaits launch by Slurm scheduler
May 17 13:16:17 UTC 2024 running job 10943 is running
May 17 13:24:26 UTC 2024 finished
:grin: SUCCESS (click triangle for details)
Details
:white_check_mark: job output file slurm-10943.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching FAILED:
:white_check_mark: no message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1715951838.tar.gzsize: 698 MiB (732495999 bytes)
entries: 74
modules under _2023.06/software/linux/x8664/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under _2023.06/software/linux/x8664/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under _2023.06/software/linux/x8664/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cudnn_host_injections.sh
.lmod/SitePackage.lua
May 17 13:24:26 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-10943.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case
ocaisa commented 4 months ago

@trz42 I will take your updated host_injections script for a test drive tomorrow, I think I have a few suggestions there and will open a PR to your branch

ocaisa commented 4 months ago

I also get the feeling that if we are going to move to easystack files (a good idea) then we should probably ship the ones we expect people to use

trz42 commented 4 months ago

@trz42 I will take your updated host_injections script for a test drive tomorrow, I think I have a few suggestions there and will open a PR to your branch

Just updated the script with some improvements/fixes after my own testing.

trz42 commented 4 months ago

Run another build after several changes...

bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2

eessi-bot[bot] commented 4 months ago
Updates by the bot instance eessi-bot-mc-aws (click for details) - received bot command `build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2` from `trz42` - expanded format: `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` - handling command `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` resulted in: - submitted job `11284`, for details & status see https://github.com/EESSI/software-layer/pull/581#issuecomment-2126650177
eessi-bot[bot] commented 4 months ago
Updates by the bot instance eessi-bot-mc-azure (click for details) - received bot command `build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2` from `trz42` - expanded format: `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` - handling command `build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2` resulted in: - no jobs were submitted
eessi-bot[bot] commented 4 months ago
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/11284 date job status comment
May 23 09:28:36 UTC 2024 submitted job id 11284 awaits release by job manager
May 23 09:29:06 UTC 2024 released job awaits launch by Slurm scheduler
May 23 09:30:09 UTC 2024 running job 11284 is running
May 23 09:42:29 UTC 2024 finished
:grin: SUCCESS (click triangle for details)
Details
:white_check_mark: job output file slurm-11284.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching FAILED:
:white_check_mark: no message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716456951.tar.gzsize: 698 MiB (732492073 bytes)
entries: 75
modules under _2023.06/software/linux/x8664/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under _2023.06/software/linux/x8664/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under _2023.06/software/linux/x8664/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
.lmod/SitePackage.lua
May 23 09:42:29 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-11284.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case