Closed vladislavivanistsev closed 2 years ago
I understand how frustrating it can be to not have gpaw
fully working from the get go.
Choosing to add a dependency out of convenience in a subproject is bound to create problems: there are reasons why openmpi
does not pull ucx
as a hard dependency and I think we should honour that for the very same reasons.
In this instance, I would suggest to get in touch with the openmpi-feedstock
crowd to fully understand their choice (also ucx
is their dependency, not gpaw
's!) and if they don't agree to add ucx
as a hard dependency, it might be worthwhile to update gpaw
's documentation to indicate that ucx
might be needed.
Just to be clear: this also affects me for the very same reason so don't think I am simply being dismissive.
@gdonval Agree with the explanation. In fact, an informative message appears when installing openmpi:
In addition, the UCX support is also built but disabled by default. To enable it, first install UCX (conda install -c conda-forge ucx). Then, set the environment variables OMPI_MCA_pml="ucx" OMPI_MCA_osc="ucx" before launching your MPI processes. Equivalently, you can set the MCA parameters in the command line: mpiexec --mca pml ucx --mca osc ucx ... Note that you might also need to set UCX_MEMTYPE_CACHE=n for CUDA awareness via UCX. Please consult UCX's documentation for detail.
Here is a discussion about UCX in regard to HPC: https://github.com/conda-forge/openmpi-feedstock/pull/87
Comment:
By default openmpi failed to run with ROSE – RDMA over Converged Ethernet. Installing Unified Communication X (UCX) and adding "--mca btl_openib_rroce_enable 1" to the mpirun command, fixes the problem. How about adding UCX to the GPAW feedstock?