easybuilders / easybuild-easyconfigs

A collection of easyconfig files that describe which software to build using which build options with EasyBuild.
https://easybuild.io
GNU General Public License v2.0
374 stars 700 forks source link

{bio}[foss/2023b] GROMACS v2024.2 w/ CUDA 12.5.0 #20809

Open boegel opened 3 months ago

boegel commented 3 months ago

(created using eb --new-pr)

requires:

boegel commented 3 months ago

Test report by @boegel FAILED Build succeeded for 1 out of 2 (1 easyconfigs in total) node3306.joltik.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 545.23.08, Python 3.6.8 See https://gist.github.com/boegel/0e97c95f98a87e72fd334b973a72901f for a full test report.

edit: Timeout for MdrunCoordinationCouplingTests2Ranks because $OMP_PROC_BIND was set to TRUE in environment...

boegel commented 3 months ago

Test report by @boegel FAILED Build succeeded for 0 out of 1 (1 easyconfigs in total) node3903.accelgor.os - Linux RHEL 8.8, x86_64, AMD EPYC 7413 24-Core Processor, 1 x NVIDIA NVIDIA A100-SXM4-80GB, 545.23.08, Python 3.6.8 See https://gist.github.com/boegel/cf1254074701e06327ebcf84b832673c for a full test report.

edit: Timeout for MdrunCoordinationCouplingTests2Ranks because $OMP_PROC_BIND was set to TRUE in environment (?)

boegel commented 3 months ago

Not setting $OMP_PROC_BIND to BIND on our system is not an option, because then the GROMACS test suite doesn't finish even after 11 hours (still running)...

boegel commented 3 months ago

Maybe we should always use -DGMX_TEST_TIMEOUT_FACTOR to increase the timeout a bit, see also https://gitlab.com/gromacs/gromacs/-/issues/5062`.

I think the issue in my case is that I'm running in a Slurm job that's asking for a partial node, and I'm not getting lucky w.r.t. which cores are assigned for the job, which makes this particular tests quite slow...

bedroge commented 3 months ago

Test report by @bedroge SUCCESS Build succeeded for 1 out of 1 (1 easyconfigs in total) gpu2 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz (skylake_avx512), 1 x NVIDIA GRID V100D-32Q, 535.161.07, Python 3.6.8 See https://gist.github.com/bedroge/d229b0d73d21ced263c31f40a2f15f5c for a full test report.

branfosj commented 3 months ago

Test report by @branfosj SUCCESS Build succeeded for 1 out of 1 (1 easyconfigs in total) bear-pg0208u15a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-40GB, 535.154.05, Python 3.6.8 See https://gist.github.com/branfosj/3dd073c237e57b2305fcf41b13f9f6ba for a full test report.

bedroge commented 3 months ago

Not setting $OMP_PROC_BIND to BIND on our system is not an option, because then the GROMACS test suite doesn't finish even after 11 hours (still running)...

I'm now running it in a Slurm job on an Icelake+A100 node (the successful test report was done on an interactive node without Slurm), and that one also seems to get stuck or something. The test step of the first iteration has been running for more than an hour, while it only took 18 minutes for the interactive V100 build.

SebastianAchilles commented 3 months ago

@boegelbot please test @ jsc-zen3-a100 CORE_CNT=16

boegelbot commented 3 months ago

@SebastianAchilles: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=20809 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_20809 --ntasks="16" --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

Test results coming soon (I hope)...

*- notification for comment with ID 2178169901 processed* *Message to humans: this is just bookkeeping information for me, it is of no use to you (unless you think I have a bug, which I don't).*
boegelbot commented 3 months ago

Test report by @boegelbot SUCCESS Build succeeded for 1 out of 1 (1 easyconfigs in total) jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 550.54.15, Python 3.9.18 See https://gist.github.com/boegelbot/bb3a1716c849cb6a6085c1b64622964e for a full test report.

akesandgren commented 3 months ago

Test report by @akesandgren SUCCESS Build succeeded for 1 out of 1 (1 easyconfigs in total) b-cn1611.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 550.78, Python 3.10.12 See https://gist.github.com/akesandgren/446251f9d574523a8ff45149c187361b for a full test report.

akesandgren commented 3 months ago

Test report by @akesandgren SUCCESS Build succeeded for 1 out of 1 (1 easyconfigs in total) b-cn1502.hpc2n.umu.se - Linux Ubuntu 20.04, x86_64, Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz, 2 x NVIDIA Tesla V100-PCIE-16GB, 545.29.06, Python 3.8.10 See https://gist.github.com/akesandgren/8236afa5668895b2c6201bce5be8e959 for a full test report.

akesandgren commented 3 months ago

Test report by @akesandgren SUCCESS Build succeeded for 1 out of 1 (1 easyconfigs in total) b-cn1602.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 9454 48-Core Processor, 4 x NVIDIA NVIDIA H100 80GB HBM3, 550.78, Python 3.10.12 See https://gist.github.com/akesandgren/259dc4b1f2a21ccb5103d17f61c10614 for a full test report.

akesandgren commented 3 months ago

Test report by @akesandgren SUCCESS Build succeeded for 1 out of 1 (1 easyconfigs in total) b-cn1604.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 9254 24-Core Processor, 2 x NVIDIA NVIDIA L40S, 550.78, Python 3.10.12 See https://gist.github.com/akesandgren/cb3dec04d82d480e9bad23a4d85f412f for a full test report.

akesandgren commented 3 months ago

I had no problem building this for an A40 on a broadwell node in a non-interactive batch job.

boegel commented 3 months ago

Test report by @boegel FAILED Build succeeded for 0 out of 1 (1 easyconfigs in total) node3302.joltik.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 545.23.08, Python 3.6.8 See https://gist.github.com/boegel/516323bafd0a78e15c512921d7f25383 for a full test report.

boegel commented 3 months ago

Test report by @boegel FAILED Build succeeded for 0 out of 1 (1 easyconfigs in total) node3902.accelgor.os - Linux RHEL 8.8, x86_64, AMD EPYC 7413 24-Core Processor, 1 x NVIDIA NVIDIA A100-SXM4-80GB, 545.23.08, Python 3.6.8 See https://gist.github.com/boegel/0b1eca1a8e7788ef14314e1781129edb for a full test report.

boegel commented 2 months ago

@boegelbot please test @ jsc-zen3-a100 CORE_CNT=16

boegelbot commented 2 months ago

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=20809 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_20809 --ntasks="16" --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

Test results coming soon (I hope)...

*- notification for comment with ID 2206073712 processed* *Message to humans: this is just bookkeeping information for me, it is of no use to you (unless you think I have a bug, which I don't).*
boegel commented 2 months ago

Hmm, seems like GPU node is down in jsc-zen3 currently, that's why it's taking so long to come back with test report result...

branfosj commented 2 months ago

Test report by @branfosj SUCCESS Build succeeded for 7 out of 7 (1 easyconfigs in total) bear-pg0208u15a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-40GB, 535.154.05, Python 3.6.8 See https://gist.github.com/branfosj/bdd71ca9580e9d808e7e994b5b49b221 for a full test report.

boegelbot commented 1 month ago

Test report by @boegelbot SUCCESS Build succeeded for 1 out of 1 (1 easyconfigs in total) jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 555.42.06, Python 3.9.18 See https://gist.github.com/boegelbot/4550d0e26ac916b65979b04ee1f2a134 for a full test report.