DeiC-HPC / cotainr

cotainr - a user space Apptainer/Singularity container builder.
European Union Public License 1.2
18 stars 5 forks source link

cotainr still looks for old base images on LUMI #67

Open kaare-mikkelsen opened 1 week ago

kaare-mikkelsen commented 1 week ago

when building using --system=lumi-g, I get the following error:

SingularitySandbox.err:-: FATAL: Unable to build from /appl/local/containers/sif-images/lumi-rocm-rocm-5.6.1.sif: unable to open file /appl/local/containers/sif-images/lumi-rocm-rocm-5.6.1.sif: open /appl/local/containers/sif-images/lumi-rocm-rocm-5.6.1.sif: no such file or directory

Which is of course pretty reasonable.

Using

--base-image=/appl/local/containers/sif-images/lumi-pytorch-rocm-6.0.3-python-3.12-pytorch-v2.3.1.sif

seems to work fine. I am importing cotainr from CrayEnv.

akx commented 1 week ago

This is probably a LUMI packaging/configuration issue, since this repository doesn't specify a systems.json that'd contain lumi-g (other than as an example in the readme and documentation).

kaare-mikkelsen commented 1 week ago

sounds likely. where would you suggest I file it instead?

akx commented 1 week ago

https://lumi-supercomputer.eu/user-support/need-help/ probably :)

TheBlackKoala commented 1 week ago

@kaare-mikkelsen i believe @joasode is aware of this and looking into it.

Chroxvi commented 1 week ago

Yes, https://lumi-supercomputer.eu/user-support/need-help/ is the right place to report this. However, as @TheBlackKoala noted we are already working with the LUMI User Support Team to sort this out, so no need to open a ticket this time. The core issue here is that there are currently no officially supported LUMI base images available following the recent LUMI maintenance break. We are looking at possible workarounds until such base images become available. Hopefully, we will have some recommendations ready later today or tomorrow.

Chroxvi commented 1 week ago

Some context

There was a big system update to LUMI a couple of weeks ago. It included an update to the KFD/AMDGPU driver on LUMI to align with ROCm 6.0. This driver version officially supports ROCm 5.6-6.2. All the previous LUMI ROCm base images used with cotainr build --system=lumi-g ... on LUMI were deprecated following the maintenance break since they where either based on the now too old ROCm 5.4-5.5 and/or built against a now incompatible version of the Cray Libfabric network stack on LUMI (which is used for fully hardware accelerated RCCL via the aws-ofi-rccl plugin when scaling to multiple compute nodes). The old base images are still available on LUMI under /appl/local/containers/prior-sep2024-update/sif-images/. The ROCm 5.6.x images may still work depending on your specific use case. So far the LUMI User Support Team has not release any new base image for ROCm 5.6-6.2 built against the new Cray Libfabric network stack on LUMI. Only a few new PyTorch containers are currently available under /appl/local/containers/sif-images/. It is unclear when new LUMI ROCm base images will be available. As soon as they become available, we will update the cotainr installation in the CrayEnv stack on LUMI to make them available via --system=lumi-g.

Workarounds

You can always manually pick a base image for use with cotainr via --base-image=<some_base_image_URI> instaed of using --system. Until --system=lumi-g works again on LUMI, here are some suggested base images to use for different use cases. These are all suboptimal in different ways, so please only use them until a proper set of LUMI base images become available:

Use case Base image Notes
Need ROCm >=6.0, but only scales to a single node on LUMI docker://rocm/dev-ubuntu-22.04:6.0.2-complete (or another tag, if needed) Falls back to communication via sockets across multiple nodes which doesn't scale very well.
Need ROCm 5.7 or 6.0 and needs to scale to multiple nodes /appl/local/containers/sif-images/lumi-pytorch-rocm-5.7.3-python-3.12-pytorch-v2.2.2.sif or /appl/local/containers/sif-images/lumi-pytorch-rocm-6.0.3-python-3.12-pytorch-v2.3.1.sif Cotainr will install an additional conda environment into these containers in addition to the one they already provide. The resulting image becomes very large and bloated, but should otherwise work.
Need ROCm =5.6 /appl/local/containers/prior-sep2024-update/sif-images/lumi-rocm-rocm-5.6.1.sif This may or may not work depending on your use case.

A note on PyTorch versions

As for PyTorch versions, we generally recommend torch<=2.3 since 2.4 currently causes crashes on LUMI in certain situations. Also, we have seen degraded performance for some PyTorch and Tensorflow training workflows following the maintenance break on LUMI. We are still investigating this.