Closed kaare-mikkelsen closed 1 month ago
This is probably a LUMI packaging/configuration issue, since this repository doesn't specify a systems.json
that'd contain lumi-g
(other than as an example in the readme and documentation).
sounds likely. where would you suggest I file it instead?
@kaare-mikkelsen i believe @joasode is aware of this and looking into it.
Yes, https://lumi-supercomputer.eu/user-support/need-help/ is the right place to report this. However, as @TheBlackKoala noted we are already working with the LUMI User Support Team to sort this out, so no need to open a ticket this time. The core issue here is that there are currently no officially supported LUMI base images available following the recent LUMI maintenance break. We are looking at possible workarounds until such base images become available. Hopefully, we will have some recommendations ready later today or tomorrow.
There was a big system update to LUMI a couple of weeks ago. It included an update to the KFD/AMDGPU driver on LUMI to align with ROCm 6.0. This driver version officially supports ROCm 5.6-6.2. All the previous LUMI ROCm base images used with cotainr build --system=lumi-g ...
on LUMI were deprecated following the maintenance break since they where either based on the now too old ROCm 5.4-5.5 and/or built against a now incompatible version of the Cray Libfabric network stack on LUMI (which is used for fully hardware accelerated RCCL via the aws-ofi-rccl plugin when scaling to multiple compute nodes). The old base images are still available on LUMI under /appl/local/containers/prior-sep2024-update/sif-images/
. The ROCm 5.6.x images may still work depending on your specific use case. So far the LUMI User Support Team has not release any new base image for ROCm 5.6-6.2 built against the new Cray Libfabric network stack on LUMI. Only a few new PyTorch containers are currently available under /appl/local/containers/sif-images/
. It is unclear when new LUMI ROCm base images will be available. As soon as they become available, we will update the cotainr installation in the CrayEnv
stack on LUMI to make them available via --system=lumi-g
.
You can always manually pick a base image for use with cotainr via --base-image=<some_base_image_URI>
instaed of using --system
. Until --system=lumi-g
works again on LUMI, here are some suggested base images to use for different use cases. These are all suboptimal in different ways, so please only use them until a proper set of LUMI base images become available:
Use case | Base image | Notes |
---|---|---|
Need ROCm >=6.0, but only scales to a single node on LUMI | docker://rocm/dev-ubuntu-22.04:6.0.2-complete (or another tag, if needed) |
Falls back to communication via sockets across multiple nodes which doesn't scale very well. |
Need ROCm 5.7 or 6.0 and needs to scale to multiple nodes | /appl/local/containers/sif-images/lumi-pytorch-rocm-5.7.3-python-3.12-pytorch-v2.2.2.sif or /appl/local/containers/sif-images/lumi-pytorch-rocm-6.0.3-python-3.12-pytorch-v2.3.1.sif |
Cotainr will install an additional conda environment into these containers in addition to the one they already provide. The resulting image becomes very large and bloated, but should otherwise work. |
Need ROCm =5.6 | /appl/local/containers/prior-sep2024-update/sif-images/lumi-rocm-rocm-5.6.1.sif | This may or may not work depending on your use case. |
As for PyTorch versions, we generally recommend torch<=2.3 since 2.4 currently causes crashes on LUMI in certain situations. Also, we have seen degraded performance for some PyTorch and Tensorflow training workflows following the maintenance break on LUMI. We are still investigating this.
On LUMI, the following LUMI ROCm base images are now available under /appl/local/containers/sif-images/
:
On LUMI, you may use these with the --base-image
option, e.g. cotainr build my_container.sif --base-image=/appl/local/containers/sif-images/lumi-rocm-rocm-6.0.3.sif --conda-env=my_conda_env.yml
.
We'll try to update the cotainr installation on LUMI ASAP to use lumi-rocm-rocm-6.0.3.sif
when specifying --system=lumi-g
.
when building using --system=lumi-g, I get the following error:
SingularitySandbox.err:-: FATAL: Unable to build from /appl/local/containers/sif-images/lumi-rocm-rocm-5.6.1.sif: unable to open file /appl/local/containers/sif-images/lumi-rocm-rocm-5.6.1.sif: open /appl/local/containers/sif-images/lumi-rocm-rocm-5.6.1.sif: no such file or directory
Which is of course pretty reasonable.
Using
--base-image=/appl/local/containers/sif-images/lumi-pytorch-rocm-6.0.3-python-3.12-pytorch-v2.3.1.sif
seems to work fine. I am importing cotainr from CrayEnv.