Closed leofang closed 1 year ago
Internal ticket: CUQNT-1594.
This is fixed in cuQuantum 23.03. We now ask users to link libcutensornet_distributed_interface_mpi.so
to MPI by passing -lmpi
to the compiler/linker, if users are building it themselves.
The cuTensorNet-MPI wrapper library (libcutensornet_distributed_interface_mpi.so) needs to be linked to the MPI library libmpi.so. If you use our conda-forge packages or cuQuantum Appliance container, or compile your own using the provided activate_mpi.sh script, this is taken care for you.
https://docs.nvidia.com/cuda/cuquantum/cutensornet/release_notes.html#cutensornet-v2-1-0
Hi @leofang sorry to re-open this, but after a while I tried automatic contraction with cuQuantum 23.06 (basically this script https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/cutensornet/coarse/example22_mpi_auto.py) on Perlmutter, using its MPICH, and I again got the CUTENSORNET_STATUS_DISTRIBUTED_FAILURE error.
I do build cuQuantum from conda-forge, so according to you the linking to MPI should be sorted... however I fetch only an "external" placeholder mpich from conda-forge and then build mpi4py locally - could it be that due to that the libcutensornet_distributed_interface_mpi.so is not linked properly?
Ah there we go:
ldd ~/.conda/envs/py-cuquantum-23.06.0-mypich-py3.9/lib/libcutensornet_distributed_interface_mpi.so
linux-vdso.so.1 (0x00007fffc8de5000)
libmpi.so.12 => not found
libc.so.6 => /lib64/libc.so.6 (0x00007ff8c1f36000)
/lib64/ld-linux-x86-64.so.2 (0x00007ff8c2157000)
Let me try to link it manually if I can...
Done - I added the MPICH's /lib-abi-mpich
path to $LD_LIBRARY_PATH, relinked and it works now! Sorry for the noise :)
MPICH users running this sample might see the following error:
This is a known issue for the automatic MPI support using cuQuantum Python 22.11 / cuTensorNet 2.0.0 + mpi4py + MPICH.
The reason is that Python by default dynamically loads shared libraries in the private mode (see, e.g., the documentation for
ctypes.DEFAULT_MODE
)), which breaks the assumption oflibcutensornet_distributed_interface_mpi.so
(whose path is set via$CUTENSORNET_COMM_LIB
) that MPI symbols would be loaded to the public scope.Open MPI is immune to this problem because mpi4py had to "break" this assumption due to a few old Open MPI issues.
There are multiple workarounds that users can choose:
LD_PRELOAD
, e.g.,mpiexec -n 2 -env LD_PRELOAD=$MPI_HOME/lib/libmpi.so python example22_mpi_auto.py
libcutensornet_distributed_interface_mpi.so
manually, link the MPI library to it via-lmpi
In a future release, we will add a fix to work around this limitation. See also https://github.com/NVIDIA/cuQuantum/discussions/30 for discussion.