Update to the LUMI PyTorch examples to the most recent stable versions of Python, ROCm, and PyTorch as of 09/2023. Also includes updates to the LUMI SLURM scripts since the eap partition is no longer available on LUMI.
I have tested these examples on LUMI using cotainr build lumi_pytorch_rocm_demo.sif --base-image docker://rocm/dev-ubuntu-22.04:5.6.1-complete --conda-env py311_rocm542_pytorch.yml, since the --system=lumi-g option still provides the "rocm-terminal" image which does not include all the ROCm pieces needed for the PyTorch wheels.
A few notes about performance of the examples:
Quite a lot of time (~20 minutes) is still spent on compiling MIOPEN kernels for the multi GPU example. Hopefully, this will be solved with https://github.com/pytorch/pytorch/issues/94482.
It still looks like the GPU utilization is not optimal. Probably some tweaking of the SLURM resource allocation and/or the torchrun configuration is needed to get better GPU utilization.
Update to the LUMI PyTorch examples to the most recent stable versions of Python, ROCm, and PyTorch as of 09/2023. Also includes updates to the LUMI SLURM scripts since the
eap
partition is no longer available on LUMI.I have tested these examples on LUMI using
cotainr build lumi_pytorch_rocm_demo.sif --base-image docker://rocm/dev-ubuntu-22.04:5.6.1-complete --conda-env py311_rocm542_pytorch.yml
, since the--system=lumi-g
option still provides the "rocm-terminal" image which does not include all the ROCm pieces needed for the PyTorch wheels.A few notes about performance of the examples:
torchrun
configuration is needed to get better GPU utilization.