distributed_reset_configuration failed: python: distributed_interfaces/cutensornet_distributed_interface_mpi.c:44: unpackMpiCommunicator: Assertion `sizeof(MPI_Comm) == comm->commSize' failed.

koichi-tsujino commented 1 year ago

Under the following setup.

Hardware: INSPUR NF5488M5 (V100 version) environments: Ubuntu 22.04.1 LTS Python 3.9.15 Nvidia driver: 525.60.13 cuda_12.0.r12.0 mpich-4.0.3 mpi4py 3.1.4 cuquantum 22.11.0

When I run /cuQuantum/python/samples/cutensornet/tensornet_example_mpi.py , I got. It works .

*** Printing is done only from the root process to prevent jumbled messages ***
The number of processes is 1
cuTensorNet-vers: 20000
===== root process device info ======
GPU-name: Tesla V100-SXM3-32GB
GPU-clock: 1597000
GPU-memoryClock: 958000
GPU-nSM: 80
GPU-major: 7
GPU-minor: 0
========================
Include headers and define data types.
Define network, modes, and extents.
Initialize the cuTensorNet library and create a network descriptor.
Process 0 has the path with the lowest FLOP count 4299161600.0.
Find an optimized contraction path with cuTensorNet optimizer.
Allocate workspace.
Create a contraction plan for cuTENSOR and optionally auto-tune it.
Contract the network, each slice uses the same contraction plan.
Check cuTensorNet result against that of cupy.einsum().
num_slices: 1
0.8309440016746521 ms / slice
5173.82831013358 GFLOPS/s
Free resource and exit.

But when I run /cuQuantum/python/samples/cutensornet/tensornet_example_mpi_auto.py I got the following error.

*** Printing is done only from the root process to prevent jumbled messages ***
The number of processes is 1
cuTensorNet-vers: 20000
===== root process device info ======
GPU-name: Tesla V100-SXM3-32GB
GPU-clock: 1597000
GPU-memoryClock: 958000
GPU-nSM: 80
GPU-major: 7
GPU-minor: 0
========================
Include headers and define data types.
Define network, modes, and extents.
Initialize the cuTensorNet library and create a network descriptor.
python: distributed_interfaces/cutensornet_distributed_interface_mpi.c:44: unpackMpiCommunicator: Assertion `sizeof(MPI_Comm) == comm->commSize' failed.
[suneo:06467] *** Process received signal ***
[suneo:06467] Signal: Aborted (6)
[suneo:06467] Signal code:  (-6)
[suneo:06467] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f55bbd22520]
[suneo:06467] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f55bbd76a7c]
[suneo:06467] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f55bbd22476]
[suneo:06467] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f55bbd087f3]
[suneo:06467] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f55bbd0871b]
[suneo:06467] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f55bbd19e96]
[suneo:06467] [ 6] /home/tsujino/anaconda3/envs/cu/lib/libcutensornet_distributed_interface_mpi.so(+0x123c)[0x7f553de1223c]
[suneo:06467] [ 7] /home/tsujino/anaconda3/envs/cu/lib/libcutensornet_distributed_interface_mpi.so(cutensornetMpiCommRank+0x23)[0x7f553de122ae]
[suneo:06467] [ 8] /home/tsujino/anaconda3/envs/cu/lib/python3.9/site-packages/cuquantum/cutensornet/../../../../libcutensornet.so.2(+0x105462)[0x7f554c705462]
[suneo:06467] [ 9] /home/tsujino/anaconda3/envs/cu/lib/python3.9/site-packages/cuquantum/cutensornet/../../../../libcutensornet.so.2(+0x1056bd)[0x7f554c7056bd]
[suneo:06467] [10] /home/tsujino/anaconda3/envs/cu/lib/python3.9/site-packages/cuquantum/cutensornet/../../../../libcutensornet.so.2(+0x1058ed)[0x7f554c7058ed]
[suneo:06467] [11] /home/tsujino/anaconda3/envs/cu/lib/python3.9/site-packages/cuquantum/cutensornet/../../../../libcutensornet.so.2(cutensornetDistributedResetConfiguration+0xd3)[0x7f554c703633]
[suneo:06467] [12] /home/tsujino/anaconda3/envs/cu/lib/python3.9/site-packages/cuquantum/cutensornet/cutensornet.cpython-39-x86_64-linux-gnu.so(+0x26063)[0x7f554e65c063]
[suneo:06467] [13] python[0x507457]
[suneo:06467] [14] python(_PyObject_MakeTpCall+0x2ec)[0x4f068c]
[suneo:06467] [15] python(_PyEval_EvalFrameDefault+0x525b)[0x4ec9fb]
[suneo:06467] [16] python[0x4e689a]
[suneo:06467] [17] python(_PyEval_EvalCodeWithName+0x47)[0x4e6527]
[suneo:06467] [18] python(PyEval_EvalCodeEx+0x39)[0x4e64d9]
[suneo:06467] [19] python(PyEval_EvalCode+0x1b)[0x59329b]
[suneo:06467] [20] python[0x5c0ad7]
[suneo:06467] [21] python[0x5bcb00]
[suneo:06467] [22] python[0x4566f4]
[suneo:06467] [23] python(PyRun_SimpleFileExFlags+0x1a2)[0x5b67e2]
[suneo:06467] [24] python(Py_RunMain+0x37e)[0x5b3d5e]
[suneo:06467] [25] python(Py_BytesMain+0x39)[0x587349]
[suneo:06467] [26] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f55bbd09d90]
[suneo:06467] [27] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f55bbd09e40]
[suneo:06467] [28] python[0x5871fe]
[suneo:06467] *** End of error message ***
Aborted (core dumped)

I have tried other smaples and those works.

haidarazzam commented 1 year ago

Dear koichi-tsujino, Thank you very much for testing cuQuantum 22.11 and reporting the issue. Could you please verify that the environment variables related to setting where the MPI lib is well defined as noted in the docs and if yes, is there are multiple mpi libraries installed on the system such that the wrapper is complied with one while the app is loading the other mpi lib? Could you please to check verify that you are using openMPI for both the wrapper and the app? Can you please also set CUTENSORNET_LOG_LEVEL=5 so we can see more details in the output. Thanks

DmitryLyakh commented 1 year ago

Could you please try building and running the tensornet_example_mpi_auto C sample on your machine (samples inside https://github.com/NVIDIA/cuQuantum/tree/main/samples/cutensornet)? Before running the sample, could you please additionally check the environment variable $CUTENSORNET_COMM_LIB that is supposed to point to the libcutensornet_distributed_interface_mpi.so wrapper library.

DmitryLyakh commented 1 year ago

One possible reason why you observe a crash is that the MPI library linked to by the sample you are running is different from the MPI library used by the MPI wrapper libcutensornet_distributed_interface_mpi.so, in case multiple MPI libraries are present in your system. In the meantime, let me try to reproduce your issue locally ...

DmitryLyakh commented 1 year ago

On our local machine, the C/C++ sampler tensornet_example_mpi_auto works fine with both MPICH and OpenMPI. I would guess the issue could be related to the Python environment setup or something ...

leofang commented 1 year ago

Let's convert this to a discussion thread and continue there, since this is not a bug report.

NVIDIA / cuQuantum

distributed_reset_configuration failed: python: distributed_interfaces/cutensornet_distributed_interface_mpi.c:44: unpackMpiCommunicator: Assertion `sizeof(MPI_Comm) == comm->commSize' failed. #27