NVIDIA / gdrcopy

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
MIT License
898 stars 144 forks source link

MAINT: gdr_unmap segfault on master branch via NVSHMEM 2.10.1 on Cray Slingshot 11 with cuFFTMp #296

Open tylerjereddy opened 6 months ago

tylerjereddy commented 6 months ago

Working on Cray Slingshot 11, on 2 nodes with 4 x A100 each, with the test case from https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFTMp/samples/r2c_c2r_slabs_GROMACS, modified in this way to force multi-node NVSHMEM (2.10.1):

```diff diff --git a/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile b/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile index 5d9fa3e..64e39be 100644 --- a/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile +++ b/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile @@ -15,4 +15,4 @@ $(exe): $(exe).cu build: $(exe) run: $(exe) - LD_LIBRARY_PATH="${NVSHMEM_LIB}:${CUFFT_LIB}:${LD_LIBRARY_PATH}" mpirun -oversubscribe -n 4 $(exe) + LD_LIBRARY_PATH="${NVSHMEM_LIB}:${CUFFT_LIB}:${LD_LIBRARY_PATH}" /lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin/mpirun -oversubscribe -n 8 -N 4 $(exe) ```

I'm seeing the output/backtrace below the fold:

``` Hello from rank 7/8 using GPU 3 Hello from rank 4/8 using GPU 0 Hello from rank 5/8 using GPU 1 Hello from rank 6/8 using GPU 2 Hello from rank 3/8 using GPU 3 Hello from rank 1/8 using GPU 1 Hello from rank 2/8 using GPU 2 Hello from rank 0/8 using GPU 0 /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. ERR: mh is not mapped yet [nid001217:115514:0:115701] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x18) ERR: mh is not mapped yet ERR: mh is not mapped yet ==== backtrace (tid: 115701) ==== 0 0x00000000000168c0 __funlockfile() ???:0 1 0x0000000000001aa7 gdr_unmap() ???:0 2 0x0000000000032d92 cuda_gdrcopy_dev_unregister() :0 3 0x00000000000a488f cxip_unmap() :0 4 0x000000000008c165 cxip_rma_cb() cxip_rma.c:0 5 0x00000000000adfe5 cxip_evtq_progress() :0 6 0x0000000000081695 cxip_ep_progress() :0 7 0x000000000008b599 cxip_cntr_readerr() cxip_cntr.c:0 8 0x000000000000dfc2 nvshmemt_libfabric_progress() /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:446 9 0x00000000000e4bad progress_transports() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/proxy/proxy.cpp:963 10 0x00000000000e51b9 progress() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/proxy/proxy.cpp:992 11 0x000000000000a6ea start_thread() ???:0 12 0x0000000000117a6f __GI___clone() ???:0 ================================= ```

My full interactive run script is this, which will tell you a bit more about various dependency versions/paths:

```bash #!/bin/bash -l # # setup the runtime environment #export FI_LOG_LEVEL=debug #export NVSHMEM_DEBUG=TRACE export FI_HMEM=cuda export GDRCOPY_ENABLE_LOGGING=1 # we need special CXI- and CUDA-enabled version of libfabric # per: https://github.com/ofiwg/libfabric/issues/10001#issuecomment-2078604043 export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/lib64:$LD_LIBRARY_PATH" export PATH="/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/bin:$PATH" export PATH="$PATH:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/bin" export PATH="$PATH:/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin" export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/comm_libs/12.3/nccl/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-hc255f5j4fcqhtufeisjj3pytrkv4dqt/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-hc255f5j4fcqhtufeisjj3pytrkv4dqt/lib/ucx:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="/usr/projects/hpcsoft/cos2/chicoma/cuda-compat/12.0/:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/pmix-4.2.9-7kfa4s6dwyd5wlayw24vx7jai7d4oi4x/lib" export NVSHMEM_DISABLE_CUDA_VMM=1 export FI_CXI_OPTIMIZED_MRS=false export NVSHMEM_REMOTE_TRANSPORT=libfabric export MPI_HOME=/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install export CUFFT_LIB=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib export CUFFT_INC=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/include/cufftmp export NVSHMEM_LIB=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib export NVSHMEM_INC=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/include which fi_info echo "fi_info -l:" fi_info -l echo "fi_info -p cxi:" fi_info -p cxi cd /lustre/scratch5/treddy/march_april_2024_testing/github_projects/CUDALibrarySamples/cuFFTMp/samples/r2c_c2r_slabs_GROMACS make clean make build make run ```

More gruesome details about libfabric, CXI, CUDA support are described at https://github.com/ofiwg/libfabric/issues/10001, but since I'm apparently segfaulting in gdrcopy now, it may be helpful to determine what my next debugging steps should be here. I've already discussed things fairly extensively with the NVSHMEM team.

I built the latest gdrcopy master branch with gcc 12.2.0 + cuda/12.0 "modules" loaded:

make -j 32 prefix=/lustre/scratch5/treddy/march_april_2024_testing/gdrcopy_install CUDA=/usr/projects/hpcsoft/cos2/chicoma/cuda/12.0 all install

It would be awesome if I could get this working somehow. Note that I was originally getting different backtraces with gdrcopy 2.3.

pakmarkthub commented 6 months ago

Hi @tylerjereddy,

I suspect that the segfault is from somewhere in https://github.com/ofiwg/libfabric/blob/main/src/hmem_cuda_gdrcopy.c#L346-L380 or https://github.com/NVIDIA/gdrcopy/blob/master/src/gdrapi.c#L387-L411. Can you use gdb to tell the exact line that this segfault is triggered?

For GDRCopy, you may want to change https://github.com/NVIDIA/gdrcopy/blob/master/src/Makefile#L29 to -O0 -g so that it is friendlier with gdb. I guess that libfabric also has a similar compile options somewhere. Alternatively, you can manually instrument the code by adding printf and narrow down where the segfault comes from.

tylerjereddy commented 6 months ago

I'm working on it, the exact nature of the failure is not deterministic, even between trials with the same builds it seems. I'll keep working on narrowing down for at least one of the failure scenarios. I'll also paste a few more example outputs that looked a little different (they didn't actually segfault, just error).

``` ERR: mh is not mapped yet ERR: mh is not mapped yet ERR: mh is not mapped yet ERR: mh is not mapped yet ERR: mh is not mapped yet ERR: mh is not mapped yet ERR: error Cannot allocate memory(12) while mapping handle 92f4280, rounded_size=65536 offset=1fe380000 /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:nvshmemt_libfabric_rma:517: Received an error when trying to post an RMA operation ```
``` ERR: mh is not mapped yet ERR: mh is not mapped yet nid001225:11099:11099 [1] NVSHMEM INFO [5] freeing buf: 0x1477d8b80600 nid001225:11099:11099 [1] NVSHMEM INFO [5] allocated 720 bytes from mspace: 0x1478d46c5400 ptr: 0x1477d8b80600 nid001225:11101:11101 [3] NVSHMEM INFO [7] freeing buf: 0x148fb8b80600 nid001225:11101:11101 [3] NVSHMEM INFO [7] allocated 720 bytes from mspace: 0x1490c7b18400 ptr: 0x148fb8b80600 nid001225:11100:11100 [2] NVSHMEM INFO [6] freeing buf: 0x151cd8b80600 nid001225:11100:11100 [2] NVSHMEM INFO [6] allocated 720 bytes from mspace: 0x151de6ca2400 ptr: 0x151cd8b80600 nid001225:11098:11098 [0] NVSHMEM INFO [4] freeing buf: 0x148a58b80600 nid001225:11098:11098 [0] NVSHMEM INFO [4] allocated 720 bytes from mspace: 0x148b60104400 ptr: 0x148a58b80600 ERR: mh is not mapped yet ERR: mh is not mapped yet ERR: error Cannot allocate memory(12) while mapping handle 92f8050, rounded_size=65536 offset=1fe390000 /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:nvshmemt_libfabric_rma:517: Received an error when trying to post an RMA operation. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/include/internal/common/nvshmem_internal.h:nvshmemi_process_multisend_rma:302: aborting due to error in process_channel_dma ```

I see the max value of an unsigned 16-bit integer in one of the errors in there. Anyway, I'll try to dig deeper. My prints aren't showing up yet, so something I'm not understanding obviously.

tylerjereddy commented 6 months ago

Ah, completely purged out my custom install of gdrcopy and the segfault/backtrace persisted, so it looks like some component in the dependency chain is ignoring the gdrcopy that I ask NVSHMEM to use via GDRCOPY_HOME when I build NVSHMEM from source.

I did confirm that I can see prints from my custom gdrcopy install from fi_info commands, for my custom libfabric build, but something in this backtrace isn't respecting the gdrcopy version that I want to use for debugging, since it doesn't care if I build NVSHMEM with a different gdrcopy:

``` ERR: mh is not mapped yet ERR: mh is not mapped yet [nid001500:113320:0:113395] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x18) ERR: mh is not mapped yet ERR: mh is not mapped yet ==== backtrace (tid: 113395) ==== 0 0x00000000000168c0 __funlockfile() ???:0 1 0x0000000000001aa7 gdr_unmap() ???:0 2 0x0000000000032d92 cuda_gdrcopy_dev_unregister() :0 3 0x00000000000a488f cxip_unmap() :0 4 0x000000000008c165 cxip_rma_cb() cxip_rma.c:0 5 0x00000000000adfe5 cxip_evtq_progress() :0 6 0x0000000000081695 cxip_ep_progress() :0 7 0x0000000000089f99 cxip_util_cq_progress() :0 8 0x000000000004a020 ofi_cq_readfrom() :0 9 0x000000000000dff4 nvshmemt_libfabric_progress() /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:395 10 0x000000000000dff4 nvshmemt_libfabric_progress() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:132 11 0x00000000000e4bad progress_transports() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/proxy/proxy.cpp:963 12 0x00000000000e51b9 progress() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/proxy/proxy.cpp:992 13 0x000000000000a6ea start_thread() ???:0 14 0x0000000000117a6f __GI___clone() ???:0 ```
pakmarkthub commented 6 months ago

Hi @tylerjereddy ,

I reviewed the NVSHMEM libfabric transport code. It does not use GDRCopy with Slingshot -- at least in NVSHMEM 2.10.1. However, libfabric itself (not NVSHMEM libfabric transport) uses GDRCopy. Based on the backtrace logs you posted, I think NVSHMEM calls into libfabric, which in turn triggers this issue. I think we can ignore NVSHMEM for now.

Guessing from your first comment, you originally ran with GDRCopy v2.3 and then moved to the master branch, right? Do you have root access on your system? Have you reloaded the gdrdrv driver from the master branch? If you have root access, can you enable debugging in the gdrdrv driver? After compiling GDRCopy, you can simply modify https://github.com/NVIDIA/gdrcopy/blob/master/insmod.sh#L28 to set dbg_enabled=0 info_enabled=0 and call sudo ./insmod.sh. Please run sudo dmesg -w on a separate shell. When you run your application and hit a GDRCopy error, you can see more lines in dmesg. Please show me those lines.

ERR:  error Cannot allocate memory(12) while mapping handle 92f4280, rounded_size=65536 offset=1fe380000

This line does not make sense to me. In most cases, the error code should be propagated from the gdrdrv driver. However, the driver never returns -ENOMEM (12) in the mmap path. And that line with that phrase can only be printed out from mmap inside libgdrapi. One possibility is that ENOMEM is a stale error number from some other code paths. Before this line, can you add printf("ERRNO before calling mmap %d\n", errno);? You can also reset errno = 0 before calling mmap too.

tylerjereddy commented 6 months ago

I don't have root access, it is a supercomputer at LANL. I could perhaps try linking your suggestions for HPC support to see if there's anything they can check.

tylerjereddy commented 6 months ago

Have you reloaded the gdrdrv driver from the master branch?

I think the HPC admins are looking into your comment a bit, but I wanted to check on a few things:

pakmarkthub commented 6 months ago

any risk that some problems arise because I'm building a newer gdrcopy than the driver version available on the HPC machine?

libgdrapi.so and gdrdrv (driver) are forward and backward compatible. Still, there might be some bugs we have fixed in a newer version of gdrdrv. It would be good to use the latest release version.

Your application talks to libgdrapi.so (not directly to gdrdrv). For this one, it is backward compatible only. For example, if you compile with GDRCopy v2.4, we cannot guarantee that your application will work with libgdrapi.so v2.3.

any risk that CXI proper (closed source HPE thing I think?) is somehow associated with an older gdrcopy version and/or driver? Or should LD_LIBRARY_PATH allow me to easily swap gdrcopy versions at runtime irrespective of how CXI was installed and the specific gdrcopy driver version available?

I don't know the answer. Is this a user-space library or a driver? If it is a user-space library, you probably can ldd <lib.so> and see if it links with libgdrapi.so. It is possible that they use dlopen. That will be more challenging to detect. If it is a driver, the answer is no. gdrdrv does not export any symbols. No other drivers can call into gdrdrv.

By the way, you may want to try setting use_persistent_mapping=1 on some systems. This is a gdrdrv module parameter. You set it when you load gdrdrv. I did not suggest this because the issues you encountered were during gdr_map. Without that use_persistent_mapping=1, you may run out of GPU BAR1. But an error should show up during gdr_pin_buffer or when you call ibv_reg_mr (from the IB stack). So, this parameter might be irrelevant, but you can try to set it if you plan to reload gdrdrv to enable the debug mode.

tylerjereddy commented 6 months ago

So, my debug prints were not showing up because prepending my custom gdrcopy builds to LD_LIBRARY_PATH was insufficient to override the gdrcopy that was linked to ucx, which in turn was linked to OpenMPI. That's pretty confusing, but for now swapping to a ucx at runtime that does not have gdrcopy linked to it allows me to see my prints from another gdrcopy loaded in LD_LIBRARY_PATH.

Anyway, now I should be able to report some better debug prints.

tylerjereddy commented 6 months ago

More detailed debug analysis below, now that I can use custom gdrcopy build with lower optimization level and interwoven prints. Keep in mind that the errors are not fully deterministic, so that does still make it a little tricky to drill down, but these analyses should be deeper than before at least.

  1. The prints in this first error case run right past the final err block of gdr_unmap and the backtrace has more info now, that ultimately seems to lead to gdr_unpin_buffer() at https://github.com/NVIDIA/gdrcopy/blob/bb139287bfe4dd2566bc2d422af1a5082e51f353/src/gdrapi.c#L270 (since my debug prints change the line numbering a bit)
Details for backtrace scenario 1 ``` gdr_unmap checkpoint 1 gdr_unmap checkpoint 2 gdr_unmap checkpoint 3 gdr_unmap checkpoint 4 gdr_unmap checkpoint 5 gdr_unmap checkpoint 6 gdr_unmap checkpoint 7 [nid001453:60481:0:60481] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41) [nid001492:121042:0:121042] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41) [nid001453:60479:0:60479] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41) [nid001492:121044:0:121044] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41) ==== backtrace (tid: 60479) ==== 0 0x00000000000168c0 __funlockfile() ???:0 1 0x000000000000170b gdr_unpin_buffer() /lustre/scratch5/treddy/march_april_2024_testing/github_projects/gdrcopy/src/gdrapi.c:271 2 0x0000000000032da6 cuda_gdrcopy_dev_unregister() :0 3 0x00000000000a488f cxip_unmap() :0 4 0x000000000008c165 cxip_rma_cb() cxip_rma.c:0 5 0x00000000000adfe5 cxip_evtq_progress() :0 6 0x0000000000081695 cxip_ep_progress() :0 7 0x000000000008b6c9 cxip_cntr_read() cxip_cntr.c:0 8 0x000000000000e7d3 nvshmemt_libfabric_quiet() /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:441 9 0x00000000000d653a nvshmem_quiet() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/comm/quiet.cpp:51 10 0x000000000004525d nvshmemi_barrier() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:19 11 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:39 12 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:41 13 0x0000000000500a66 cufftMpDestroyReshape() ???:0 14 0x00000000004ff627 cufftMpDestroyReshape() ???:0 15 0x000000000015893a cufftMpAttachComm() ???:0 16 0x00000000004e058f cufftMpDestroyReshape() ???:0 17 0x00000000004e0a85 cufftMpDestroyReshape() ???:0 18 0x000000000014cb6e cufftMpAttachComm() ???:0 19 0x000000000011bf4f cufftXtSetCallbackSharedSize() ???:0 20 0x0000000000147511 cufftXtMakePlanGuru64() ???:0 21 0x0000000000148105 cufftXtMakePlanMany() ???:0 22 0x0000000000145d7d cufftMakePlanMany64() ???:0 23 0x00000000001461bf cufftMakePlanMany() ???:0 24 0x0000000000146386 cufftMakePlan3d() ???:0 25 0x0000000000406619 run_r2c_c2r_slabs() ???:0 26 0x00000000004079c7 main() ???:0 27 0x000000000003529d __libc_start_main() ???:0 28 0x000000000040573a _start() /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120 ================================= ```
  1. The second error scenario also runs right through the final err block of gdr_unmap, but this time it seems to hit a line of code in the closed-source HPE CXI library code:
Details of error scenario 2 ``` gdr_unmap checkpoint 1 gdr_unmap checkpoint 2 gdr_unmap checkpoint 3 gdr_unmap checkpoint 4 gdr_unmap checkpoint 1 gdr_unmap checkpoint 2 gdr_unmap checkpoint 3 gdr_unmap checkpoint 5 gdr_unmap checkpoint 6 gdr_unmap checkpoint 7 ==== backtrace (tid: 75821) ==== 0 0x00000000000168c0 __funlockfile() ???:0 1 0x000000000000c59e cxil_unmap() /workspace/src/github.hpe.com/hpe/hpc-shs-libcxi/WORKSPACE/BUILD/libcxi-0.9/src/libcxi.c:945 2 0x00000000000a47cb cxip_unmap() :0 3 0x000000000008c165 cxip_rma_cb() cxip_rma.c:0 4 0x00000000000adfe5 cxip_evtq_progress() :0 5 0x0000000000081695 cxip_ep_progress() :0 6 0x000000000008b6c9 cxip_cntr_read() cxip_cntr.c:0 7 0x000000000000e7d3 nvshmemt_libfabric_quiet() /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:441 8 0x00000000000d653a nvshmem_quiet() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/comm/quiet.cpp:51 9 0x000000000004525d nvshmemi_barrier() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:19 10 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:39 11 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:41 12 0x0000000000500a66 cufftMpDestroyReshape() ???:0 13 0x00000000004ff598 cufftMpDestroyReshape() ???:0 14 0x000000000015893a cufftMpAttachComm() ???:0 15 0x00000000004e058f cufftMpDestroyReshape() ???:0 16 0x00000000004e0a85 cufftMpDestroyReshape() ???:0 17 0x000000000014cb6e cufftMpAttachComm() ???:0 18 0x000000000011bf4f cufftXtSetCallbackSharedSize() ???:0 19 0x0000000000147511 cufftXtMakePlanGuru64() ???:0 20 0x0000000000148105 cufftXtMakePlanMany() ???:0 21 0x0000000000145d7d cufftMakePlanMany64() ???:0 22 0x00000000001461bf cufftMakePlanMany() ???:0 23 0x0000000000146386 cufftMakePlan3d() ???:0 24 0x0000000000406619 run_r2c_c2r_slabs() ???:0 25 0x00000000004079c7 main() ???:0 26 0x000000000003529d __libc_start_main() ???:0 27 0x000000000040573a _start() /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120 ```
  1. The third error scenario (in three tries!) also goes through the err block of gdr_unmap, and this time the final line reported in the backtrace is at https://github.com/NVIDIA/gdrcopy/blob/bb139287bfe4dd2566bc2d422af1a5082e51f353/src/gdrapi.c#L396 (because of the debug prints changing line nums)
Details of error scenario 3 ``` gdr_unmap checkpoint 1 gdr_unmap checkpoint 2 gdr_unmap checkpoint 3 gdr_unmap checkpoint 1 gdr_unmap checkpoint 2 gdr_unmap checkpoint 3 gdr_unmap checkpoint 4 gdr_unmap checkpoint 5 gdr_unmap checkpoint 6 gdr_unmap checkpoint 7 ==== backtrace (tid: 76631) ==== 0 0x00000000000168c0 __funlockfile() ???:0 1 0x0000000000001caf gdr_unmap() /lustre/scratch5/treddy/march_april_2024_testing/github_projects/gdrcopy/src/gdrapi.c:403 2 0x0000000000032d92 cuda_gdrcopy_dev_unregister() :0 3 0x00000000000a488f cxip_unmap() :0 4 0x000000000008c165 cxip_rma_cb() cxip_rma.c:0 5 0x00000000000adfe5 cxip_evtq_progress() :0 6 0x0000000000081695 cxip_ep_progress() :0 7 0x000000000008b6c9 cxip_cntr_read() cxip_cntr.c:0 8 0x000000000000e7d3 nvshmemt_libfabric_quiet() /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:441 9 0x00000000000d653a nvshmem_quiet() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/comm/quiet.cpp:51 10 0x000000000004525d nvshmemi_barrier() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:19 11 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:39 12 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:41 13 0x0000000000500a66 cufftMpDestroyReshape() ???:0 14 0x00000000004ff598 cufftMpDestroyReshape() ???:0 15 0x000000000015893a cufftMpAttachComm() ???:0 16 0x00000000004e058f cufftMpDestroyReshape() ???:0 17 0x00000000004e0a85 cufftMpDestroyReshape() ???:0 18 0x000000000014cb6e cufftMpAttachComm() ???:0 19 0x000000000011bf4f cufftXtSetCallbackSharedSize() ???:0 20 0x0000000000147511 cufftXtMakePlanGuru64() ???:0 21 0x0000000000148105 cufftXtMakePlanMany() ???:0 22 0x0000000000145d7d cufftMakePlanMany64() ???:0 23 0x00000000001461bf cufftMakePlanMany() ???:0 24 0x0000000000146386 cufftMakePlan3d() ???:0 25 0x0000000000406619 run_r2c_c2r_slabs() ???:0 26 0x00000000004079c7 main() ???:0 27 0x000000000003529d __libc_start_main() ???:0 28 0x000000000040573a _start() /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120 ================================= ```

Does this give you any more traction to diagnose the problem? While I wait to hear back about the debug driver stuff, anything else you want me to try here? It also seems to me like there's a misunderstanding somewhere with UCX + gdrcopy + OpenMPI if my provider is actually CXI? I was originally asked to build OpenMPI linked to UCX + gdrcopy + CUDA.

pakmarkthub commented 6 months ago

Thank you @tylerjereddy. I suspect that you may run into a race condition from multithreading. GDRCopy, especially libgdrapi.so, is not thread safe. Anyway, I added a global lock to some functions in this branch https://github.com/NVIDIA/gdrcopy/tree/dev-issue-296-exp. Please try if it helps. You just need to recompile libgdrapi.so and use that. There is no need to install a new gdrdrv driver.

tylerjereddy commented 6 months ago

I still see errors that are not deterministic on that branch (I reduced the optimization level again as well).

``` Hello from rank 5/8 using GPU 1 Hello from rank 4/8 using GPU 0 Hello from rank 7/8 using GPU 3 Hello from rank 6/8 using GPU 2 Hello from rank 2/8 using GPU 2 Hello from rank 3/8 using GPU 3 Hello from rank 1/8 using GPU 1 Hello from rank 0/8 using GPU 0 ERR: Error in pthread_mutex_init with errno=5246 ERR: Error in pthread_mutex_init with errno=5291 ERR: Error in pthread_mutex_init with errno=5336 ERR: Error in pthread_mutex_init with errno=5364 ERR: Error in pthread_mutex_init with errno=5364 ```

Note that ERR: Error in pthread_mutex_init with errno=5296 occurs even on some simple fi_info commands now, like fi_info -p cxi.

Hanging seems more common on this branch now as well, and fi_info commands seem slower. Perhaps not surprising if something isn't quite right with lock acquisition I suppose?

pakmarkthub commented 6 months ago

Sorry, there was a left-over code block. I just removed it. Please try again.

Note that this is not our final solution. It is just an adhoc implementation to see if it helps. It might not work if the caller calls a GDRCopy API with stale memory handle. For example, if they call gdr_close and then gdr_unmap or gdr_unpin_buffer, libgdrapi.so will access the memory handle object that has already been freed.

tylerjereddy commented 6 months ago

Here's the backtrace for the 2-node cuFFTMp reproducer with your updated branch (with optimization level reduced):

``` ERR: mh is not mapped yet ERR: mh is not mapped yet ERR: mh is not mapped yet ERR: mh is not mapped yet ERR: mh is not mapped yet ERR: mh is not mapped yet [nid001237:69241:0:69241] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41) [nid001240:34042:0:34042] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41) ==== backtrace (tid: 69241) ==== 0 0x00000000000168c0 __funlockfile() ???:0 1 0x000000000000176c _gdr_unpin_buffer() /lustre/scratch5/treddy/march_april_2024_testing/github_projects/gdrcopy/src/gdrapi.c:281 2 0x00000000000017ca gdr_unpin_buffer() /lustre/scratch5/treddy/march_april_2024_testing/github_projects/gdrcopy/src/gdrapi.c:291 3 0x0000000000032da6 cuda_gdrcopy_dev_unregister() :0 4 0x00000000000a488f cxip_unmap() :0 5 0x000000000008c165 cxip_rma_cb() cxip_rma.c:0 6 0x00000000000adfe5 cxip_evtq_progress() :0 7 0x0000000000081695 cxip_ep_progress() :0 8 0x0000000000089f99 cxip_util_cq_progress() :0 9 0x000000000004a020 ofi_cq_readfrom() :0 10 0x000000000000dff4 nvshmemt_libfabric_progress() /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:395 11 0x000000000000dff4 nvshmemt_libfabric_progress() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:132 12 0x000000000000e7c8 nvshmemt_libfabric_quiet() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:390 13 0x00000000000d653a nvshmem_quiet() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/comm/quiet.cpp:51 14 0x000000000004525d nvshmemi_barrier() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:19 15 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:39 16 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:41 17 0x0000000000500a66 cufftMpDestroyReshape() ???:0 18 0x00000000004ff627 cufftMpDestroyReshape() ???:0 19 0x000000000015893a cufftMpAttachComm() ???:0 20 0x00000000004e058f cufftMpDestroyReshape() ???:0 21 0x00000000004e0a85 cufftMpDestroyReshape() ???:0 22 0x000000000014cb6e cufftMpAttachComm() ???:0 23 0x000000000011bf4f cufftXtSetCallbackSharedSize() ???:0 24 0x0000000000147511 cufftXtMakePlanGuru64() ???:0 25 0x0000000000148105 cufftXtMakePlanMany() ???:0 26 0x0000000000145d7d cufftMakePlanMany64() ???:0 27 0x00000000001461bf cufftMakePlanMany() ???:0 28 0x0000000000146386 cufftMakePlan3d() ???:0 29 0x0000000000406619 run_r2c_c2r_slabs() ???:0 30 0x00000000004079c7 main() ???:0 31 0x000000000003529d __libc_start_main() ???:0 32 0x000000000040573a _start() /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120 ================================= ```

So, crash seems to be near here in your new branch, in _gdr_unpin_buffer, while attempting to remove an element from a list:

https://github.com/NVIDIA/gdrcopy/blob/d2299254aff052ec1d29646f90d715589f5e0994/src/gdrapi.c#L281

Now, if we look at the special branch of libfabric I'm using, in function cuda_gdrcopy_dev_unregister(), which is called just before control flow returns to gdrcopy proper, I see two calls that may be worth asking you about. First, there is a call to gdr_unmap(), then right after that, there is a call to gdr_unpin_buffer(). Both operate on the same handle/structure member it seems. Here is the permalink to that particular libfabric branch/code block, which I think I needed for CXI support: https://github.com/thomasgillis/libfabric/blob/10caf878ccacedd2ce907e8e714a9d90d74d63ca/src/hmem_cuda_gdrcopy.c#L359-L368

The situation looks the same in the main branch of libfabric, for that particular block of code: https://github.com/ofiwg/libfabric/blob/f41cea52738da193fd312ce9cf0a1adf23acaa8f/src/hmem_cuda_gdrcopy.c#L359-L368

All of this code is in a libfabric code block with a #if ENABLE_GDRCOPY_DLOPEN preprocessor guard (or just after it..). I decided to mess around a little with that code block on the cxi-enabled branch of libfabric using the diff below the fold.

```diff --- a/src/hmem_cuda_gdrcopy.c +++ b/src/hmem_cuda_gdrcopy.c @@ -33,6 +33,7 @@ #if HAVE_CONFIG_H #include +#include #endif #include "ofi_hmem.h" @@ -356,26 +357,27 @@ int cuda_gdrcopy_dev_unregister(uint64_t handle) assert(gdrcopy); pthread_spin_lock(&global_gdr_lock); + printf("cuda_gdrcopy_dev_unregister checkpoint 1\n"); err = global_gdrcopy_ops.gdr_unmap(global_gdr, gdrcopy->mh, gdrcopy->user_ptr, gdrcopy->length); + printf("cuda_gdrcopy_dev_unregister checkpoint 2\n"); if (err) { + printf("cuda_gdrcopy_dev_unregister checkpoint 2b\n"); FI_WARN(&core_prov, FI_LOG_CORE, "gdr_unmap failed! error: %s\n", strerror(err)); goto exit; } + printf("cuda_gdrcopy_dev_unregister checkpoint 3\n"); + pthread_spin_unlock(&global_gdr_lock); + printf("cuda_gdrcopy_dev_unregister checkpoint 4\n"); - err = global_gdrcopy_ops.gdr_unpin_buffer(global_gdr, gdrcopy->mh); - if (err) { - FI_WARN(&core_prov, FI_LOG_MR, - "gdr_unmap failed! error: %s\n", - strerror(err)); - goto exit; - } exit: + printf("cuda_gdrcopy_dev_unregister checkpoint 5\n"); pthread_spin_unlock(&global_gdr_lock); free(gdrcopy); + printf("cuda_gdrcopy_dev_unregister checkpoint 6\n"); return err; } ```

Although deleting gdr_unpin_buffer doesn't really protect me from backtraces/problems, print checkpoints 5 and 6 are hit regularly, suggesting non-zero exit codes returned regularly from gdr_unmap on the special gdrcopy branch. Is this more helpful? Is there anything in that libfabric code block that could be safer/better?

pakmarkthub commented 6 months ago

@tylerjereddy Thank you for the additional info. We also call gdr_unpin_buffer inside gdr_close. But I don't expect to see segfault in LIST_REMOVE if it comes from there. A few requests:

  1. I added code instrumentation in https://github.com/NVIDIA/gdrcopy/tree/dev-issue-296-exp. Can you try again? Please show me the whole output. If possible, please separate the output from Node 1 and 2.
  2. Have you tested a standalone GDRCopy application? Do gdrcopy_copybw and gdrcopy_sanity work?
tylerjereddy commented 6 months ago

Starting with your second point of the GDRCopy test applications, I used the latest master branch of GDRCopy without modification and the dev-cxi branch of libfabric without modification (I'm guessing it didn't use libfabric here, but just to be safe...)

The output is below, the sanity check seems to "pass" but spits out errors?

``` ---------- Running gdrcopy_copybw ---------- GPU id:0; name: NVIDIA A100-SXM4-40GB; Bus id: 0000:03:00 GPU id:1; name: NVIDIA A100-SXM4-40GB; Bus id: 0000:41:00 GPU id:2; name: NVIDIA A100-SXM4-40GB; Bus id: 0000:81:00 GPU id:3; name: NVIDIA A100-SXM4-40GB; Bus id: 0000:c1:00 selecting device 0 testing size: 131072 rounded size: 131072 gpu alloc fn: cuMemAlloc device ptr: 15150b200000 map_d_ptr: 0x15152c97b000 info.va: 15150b200000 info.mapped_size: 131072 info.page_size: 65536 info.mapped: 1 info.wc_mapping: 1 page offset: 0 user-space pointer:0x15152c97b000 writing test, size=131072 offset=0 num_iters=10000 write BW: 18087.3MB/s reading test, size=131072 offset=0 num_iters=100 read BW: 18.9831MB/s unmapping buffer unpinning buffer closing gdrdrv ---------- End of gdrcopy_copybw ---------- ---------- Running gdrcopy_sanity ---------- ERR: error Invalid argument(22) while mapping handle b37fd0, rounded_size=4096 offset=10000 ERR: error Invalid argument(22) while mapping handle b38550, rounded_size=69632 offset=0 ERR: error Permission denied(13) while mapping handle b38640, rounded_size=65536 offset=0 ERR: error Permission denied(13) while mapping handle b38f80, rounded_size=65536 offset=0 ERR: ioctl error (errno=22) ERR: error Permission denied(13) while mapping handle 3066cbf0, rounded_size=91 offset=0 ERR: error Permission denied(13) while mapping handle 3066cbf0, rounded_size=91 offset=0 ERR: ioctl error (errno=13) ERR: ioctl error (errno=13) Total: 28, Passed: 28, Failed: 0, Waived: 0 ---------- End of gdrcopy_sanity ---------- ```

The modified interactive script for the 2-node test:

```bash #!/bin/bash -l # # setup the runtime environment #export FI_LOG_LEVEL=debug #export NVSHMEM_DEBUG=TRACE export FI_HMEM=cuda export GDRCOPY_ENABLE_LOGGING=1 # we need special CXI- and CUDA-enabled version of libfabric # per: https://github.com/ofiwg/libfabric/issues/10001#issuecomment-2078604043 export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/lib64:$LD_LIBRARY_PATH" export PATH="/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/bin:$PATH" export PATH="$PATH:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/bin" export PATH="$PATH:/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin" export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/comm_libs/12.3/nccl/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ik4v4abhawveafsjmxd7fqwvhagwh7lw/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ik4v4abhawveafsjmxd7fqwvhagwh7lw/lib/ucx:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="/usr/projects/hpcsoft/cos2/chicoma/cuda-compat/12.0/:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/pmix-4.2.9-7kfa4s6dwyd5wlayw24vx7jai7d4oi4x/lib" export NVSHMEM_DISABLE_CUDA_VMM=1 export FI_CXI_OPTIMIZED_MRS=false export NVSHMEM_REMOTE_TRANSPORT=libfabric export MPI_HOME=/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install export CUFFT_LIB=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib export CUFFT_INC=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/include/cufftmp export NVSHMEM_LIB=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib export NVSHMEM_INC=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/include export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/gdrcopy_install/lib:$LD_LIBRARY_PATH" which fi_info echo "fi_info -l:" fi_info -l echo "fi_info -p cxi:" fi_info -p cxi #cd /lustre/scratch5/treddy/march_april_2024_testing/github_projects/CUDALibrarySamples/cuFFTMp/samples/r2c_c2r_slabs_GROMACS #make clean #make build #make run echo "---------- Running gdrcopy_copybw ---------- " /lustre/scratch5/treddy/march_april_2024_testing/gdrcopy_install/bin/gdrcopy_copybw echo "---------- End of gdrcopy_copybw ---------- " echo "---------- Running gdrcopy_sanity ---------- " /lustre/scratch5/treddy/march_april_2024_testing/gdrcopy_install/bin/gdrcopy_sanity echo "---------- End of gdrcopy_sanity ---------- " ```
tylerjereddy commented 6 months ago

For the first point, using the latest version of dev-issue-296-exp gdrcopy branch with the original cuFFTMp 2-node reproducer, this is the raw output:

``` /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/bin/fi_info fi_info -l: ===> [12292, 12292] GDRCopy Checkpoint gdr_open: 1 ===> [12292, 12292] GDRCopy Checkpoint gdr_open: 1 verbs: version: 120.0 cxi: version: 0.1 psm3: version: 305.1010 ofi_rxd: version: 120.0 shm: version: 120.0 udp: version: 120.0 tcp: version: 120.0 ofi_hook_perf: version: 120.0 ofi_hook_trace: version: 120.0 ofi_hook_debug: version: 120.0 ofi_hook_noop: version: 120.0 ofi_hook_hmem: version: 120.0 ofi_hook_dmabuf_peer_mem: version: 120.0 off_coll: version: 120.0 sm2: version: 120.0 ofi_mrail: version: 120.0 ===> [12292, 12292] GDRCopy Checkpoint gdr_close: 1 ===> [12292, 12292] GDRCopy Checkpoint gdr_close: 4 ===> [12292, 12292] GDRCopy Checkpoint gdr_close: 5 ===> [12292, 12292] GDRCopy Checkpoint gdr_close: 6 fi_info -p cxi: ===> [12312, 12312] GDRCopy Checkpoint gdr_open: 1 ===> [12312, 12312] GDRCopy Checkpoint gdr_open: 1 provider: cxi fabric: cxi domain: cxi0 version: 0.1 type: FI_EP_RDM protocol: FI_PROTO_CXI provider: cxi fabric: cxi domain: cxi1 version: 0.1 type: FI_EP_RDM protocol: FI_PROTO_CXI ===> [12312, 12312] GDRCopy Checkpoint gdr_close: 1 ===> [12312, 12312] GDRCopy Checkpoint gdr_close: 4 ===> [12312, 12312] GDRCopy Checkpoint gdr_close: 5 ===> [12312, 12312] GDRCopy Checkpoint gdr_close: 6 rm -rf cufftmp_r2c_c2r_slabs_GROMACS /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/bin/../bin/nvcc cufftmp_r2c_c2r_slabs_GROMACS.cu -o cufftmp_r2c_c2r_slabs_GROMACS -std=c++17 --generate-code arch=compute_70,code=sm_70 --generate-code arch=compute_80,code=sm_80 --generate-code arch=compute_90,code=sm_90 -I/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/include/cufftmp -I/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/include -I/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/include -lcuda -L/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib -L/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib -lcufftMp -lnvshmem_device -lnvshmem_host -L/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib -lmpi LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib:/lustre/scratch5/treddy/march_april_2024_testing/gdrcopy_install/lib:/usr/projects/hpcsoft/cos2/chicoma/cuda-compat/12.0/:lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ik4v4abhawveafsjmxd7fqwvhagwh7lw/lib/ucx:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ik4v4abhawveafsjmxd7fqwvhagwh7lw/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/comm_libs/12.3/nccl/lib:/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/lib64:/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/lib:/opt/cray/pe/papi/7.0.0.2/lib64:/opt/cray/libfabric/1.15.2.0/lib64:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/pmix-4.2.9-7kfa4s6dwyd5wlayw24vx7jai7d4oi4x/lib" /lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin/mpirun -oversubscribe -n 8 -N 4 cufftmp_r2c_c2r_slabs_GROMACS Hello from rank 2/8 using GPU 2 Hello from rank 1/8 using GPU 1 Hello from rank 3/8 using GPU 3 Hello from rank 0/8 using GPU 0 Hello from rank 6/8 using GPU 2 Hello from rank 7/8 using GPU 3 Hello from rank 5/8 using GPU 1 Hello from rank 4/8 using GPU 0 ===> [12410, 12410] GDRCopy Checkpoint gdr_open: 1 ===> [12412, 12412] GDRCopy Checkpoint gdr_open: 1 ===> [12413, 12413] GDRCopy Checkpoint gdr_open: 1 ===> [12411, 12411] GDRCopy Checkpoint gdr_open: 1 ===> [12411, 12411] GDRCopy Checkpoint gdr_open: 1 ===> [12413, 12413] GDRCopy Checkpoint gdr_open: 1 ===> [12412, 12412] GDRCopy Checkpoint gdr_open: 1 ===> [12410, 12410] GDRCopy Checkpoint gdr_open: 1 ===> [119878, 119878] GDRCopy Checkpoint gdr_open: 1 ===> [119876, 119876] GDRCopy Checkpoint gdr_open: 1 ===> [119877, 119877] GDRCopy Checkpoint gdr_open: 1 ===> [119879, 119879] GDRCopy Checkpoint gdr_open: 1 ===> [119878, 119878] GDRCopy Checkpoint gdr_open: 1 ===> [119877, 119877] GDRCopy Checkpoint gdr_open: 1 ===> [119876, 119876] GDRCopy Checkpoint gdr_open: 1 ===> [119879, 119879] GDRCopy Checkpoint gdr_open: 1 /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. ===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5080 ===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5080, ret=0 ===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5c20 ===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5c20, ret=0 ===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5940 ===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5940, ret=0 ===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd7200 ===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd7200, ret=0 ===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5080 ===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5080, ret=0 ===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5cd0 ===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5cd0, ret=0 ===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd54e0 ===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd54e0, ret=0 ===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5840 ===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5840, ret=0 ===> [119879, 119879] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5080 ===> [119879, 119879] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5080, ret=0 ===> [119876, 119876] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5c20 ===> [119876, 119876] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5c20, ret=0 ===> [12413, 12413] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5940 ===> [12413, 12413] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5940, ret=0 ===> [12410, 12410] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd7200 ===> [12410, 12410] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd7200, ret=0 ===> [119877, 119877] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5080 ===> [119877, 119877] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5080, ret=0 ===> [12411, 12411] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5cd0 ===> [12411, 12411] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5cd0, ret=0 ===> [119878, 119878] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd54e0 ===> [119878, 119878] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd54e0, ret=0 ===> [12412, 12412] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5840 ===> [12412, 12412] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5840, ret=0 ===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5300 ===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5300, ret=0 ===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5ea0 ===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5ea0, ret=0 ===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5bc0 ===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5bc0, ret=0 ===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5300 ===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5300, ret=0 ===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd7480 ===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd7480, ret=0 ===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5760 ===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5760, ret=0 ===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5f50 ===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5f50, ret=0 ===> [119879, 119879] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5300 ===> [119879, 119879] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5300, ret=0 ===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5ac0 ===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5ac0, ret=0 ===> [119876, 119876] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5ea0 ===> [119876, 119876] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5ea0, ret=0 ===> [119877, 119877] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5300 ===> [119877, 119877] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5300, ret=0 ===> [12413, 12413] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5bc0 ===> [12413, 12413] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5bc0, ret=0 ===> [12410, 12410] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd7480 ===> [12410, 12410] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd7480, ret=0 ===> [119878, 119878] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5760 ===> [119878, 119878] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5760, ret=0 ===> [12411, 12411] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5f50 ===> [12411, 12411] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5f50, ret=0 ===> [12412, 12412] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5ac0 ===> [12412, 12412] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5ac0, ret=0 ===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e03340 ===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e03340, ret=0 ===> [12413, 12413] GDRCopy Checkpoint gdr_map: 1: mh=0x7e03340 ===> [12413, 12413] GDRCopy Checkpoint gdr_map: 2: mh=0x7e03340, ret=0 ===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e03340 ===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e03340 ===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e03340 ===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e03340, ret=0 ===> [12413, 12465] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e03340 ===> [12413, 12465] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e03340 ===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e03310 ===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e03310, ret=0 ===> [12412, 12412] GDRCopy Checkpoint gdr_map: 1: mh=0x7e03310 ===> [12412, 12412] GDRCopy Checkpoint gdr_map: 2: mh=0x7e03310, ret=0 ===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e03310 ===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e03310 ===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e03310 ===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e03310, ret=0 ===> [12412, 12464] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e03310 ===> [12412, 12464] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e03310 ===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e07540 ===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e07540, ret=0 ===> [12410, 12410] GDRCopy Checkpoint gdr_map: 1: mh=0x7e07540 ===> [12410, 12410] GDRCopy Checkpoint gdr_map: 2: mh=0x7e07540, ret=0 ===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e07540 ===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e07540 ===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e07540 ===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e07540, ret=0 ===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e03940 ===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e03940, ret=0 ===> [12411, 12411] GDRCopy Checkpoint gdr_map: 1: mh=0x7e03940 ===> [12411, 12411] GDRCopy Checkpoint gdr_map: 2: mh=0x7e03940, ret=0 ===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e03940 ===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e03940 ===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e03940 ===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e03940, ret=0 ===> [12411, 12466] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e03940 ===> [12411, 12466] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e03940 ===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e03480 ===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e03480, ret=0 ===> [119879, 119879] GDRCopy Checkpoint gdr_map: 1: mh=0x7e03480 ===> [119879, 119879] GDRCopy Checkpoint gdr_map: 2: mh=0x7e03480, ret=0 ===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e03720 ===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e03720, ret=0 ===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e03480 ===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e03480 ===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e03480 ===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e03480, ret=0 ===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e03110 ===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e03110, ret=0 ===> [119877, 119877] GDRCopy Checkpoint gdr_map: 1: mh=0x7e03110 ===> [119877, 119877] GDRCopy Checkpoint gdr_map: 2: mh=0x7e03110, ret=0 ===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e03110 ===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e03110 ===> [12410, 12467] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e07540 ===> [12410, 12467] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e07540 ===> [119878, 119878] GDRCopy Checkpoint gdr_map: 1: mh=0x7e03720 ===> [119878, 119878] GDRCopy Checkpoint gdr_map: 2: mh=0x7e03720, ret=0 ===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e03720 ===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e03720 ===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e03720 ===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e03720, ret=0 ===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e03110 ===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e03110, ret=0 ===> [119879, 119945] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e03480 ===> [119879, 119945] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e03480 ===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e04690 ===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e04690, ret=0 ===> [119876, 119876] GDRCopy Checkpoint gdr_map: 1: mh=0x7e04690 ===> [119876, 119876] GDRCopy Checkpoint gdr_map: 2: mh=0x7e04690, ret=0 ===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e04690 ===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e04690 ===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e04690 ===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e04690, ret=0 ===> [119878, 119944] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e03720 ===> [119878, 119944] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e03720 ===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e03110 ===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e03110 ===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ab90 ===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ab90, ret=0 ===> [12413, 12413] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ab90 ===> [12413, 12413] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ab90, ret=0 ===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ab40 ===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ab40, ret=0 ===> [12412, 12412] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ab40 ===> [12412, 12412] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ab40, ret=0 ===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ab40 ===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ab40 ===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ab90 ===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ab90 ===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ab90 ===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ab90, ret=0 ===> [119876, 119942] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e04690 ===> [119876, 119942] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e04690 ===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ab40 ===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ab40, ret=0 ===> [12413, 12465] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ab90 ===> [12413, 12465] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ab90 ===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1acd0 ===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1acd0, ret=0 ===> [119879, 119879] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1acd0 ===> [119879, 119879] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1acd0, ret=0 ===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1acd0 ===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1acd0 ===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1acd0 ===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1acd0, ret=0 ===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1af00 ===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1af00, ret=0 ===> [12411, 12411] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1af00 ===> [12411, 12411] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1af00, ret=0 ===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1af00 ===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1af00 ===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1af00 ===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1af00, ret=0 ===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1af50 ===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1af50, ret=0 ===> [12412, 12464] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ab40 ===> [12412, 12464] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ab40 ===> [119878, 119878] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1af50 ===> [119878, 119878] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1af50, ret=0 ERR: mh is not mapped yet ===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1af50 ===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1af50 ===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1af50 ===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1af50, ret=0 ===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1a960 ===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1a960, ret=0 ===> [119877, 119877] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1a960 ===> [119877, 119877] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1a960, ret=0 ===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1a960 ===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1a960 ===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1a960 ===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1a960, ret=0 ===> [119879, 119945] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1acd0 ===> [119879, 119945] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1acd0 ===> [12411, 12466] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1af00 ERR: mh is not mapped yet [nid001468:119877:0:119877] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xd8) ===> [12411, 12466] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1af00 ===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ed70 ===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ed70, ret=0 ===> [12410, 12410] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ed70 ===> [12410, 12410] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ed70, ret=0 ===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ed70 ===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ed70 ===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ed70 ===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ed70, ret=0 ===> [119878, 119944] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1af50 ===> [119878, 119944] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1af50 ===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ac50 ===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ac50, ret=0 ===> [12413, 12413] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ac50 ===> [12413, 12413] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ac50, ret=0 ===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ac50 ===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ac50 ===> [12410, 12467] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ed70 ===> [12410, 12467] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ed70 ===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1bec0 ===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1bec0, ret=0 ===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ac50 ===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ac50, ret=0 ===> [119876, 119876] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1bec0 ===> [119876, 119876] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1bec0, ret=0 ===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1a960 ===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1a960 ===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1bec0 ===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1bec0 ===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1bec0 ===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1bec0, ret=0 ===> [12413, 12465] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ac50 ===> [12413, 12465] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ac50 ===> [119876, 119942] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1bec0 ===> [119876, 119942] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1bec0 ===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ac00 ===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ac00, ret=0 ===> [12412, 12412] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ac00 ===> [12412, 12412] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ac00, ret=0 ===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ac00 ===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ac00 ===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ac00 ===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ac00, ret=0 ===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ad90 ===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ad90, ret=0 ===> [119879, 119879] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ad90 ===> [119879, 119879] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ad90, ret=0 ===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1afc0 ===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1afc0, ret=0 ===> [12411, 12411] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1afc0 ===> [12411, 12411] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1afc0, ret=0 ===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ad90 ===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ad90 ===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1afc0 ===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1afc0 ===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1afc0 ===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1afc0, ret=0 ===> [12412, 12464] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ac00 ===> [12412, 12464] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ac00 ===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1b010 ===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1b010, ret=0 ===> [119878, 119878] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1b010 ===> [119878, 119878] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1b010, ret=0 ===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ad90 ===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ad90, ret=0 ===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1b010 ===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1b010 ===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1b010 ===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1b010, ret=0 ===> [119879, 119945] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ad90 ===> [119879, 119945] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ad90 ===> [12411, 12466] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1afc0 ===> [12411, 12466] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1afc0 ===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1aa20 ===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1aa20, ret=0 ===> [119877, 119877] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1aa20 ===> [119877, 119877] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1aa20, ret=0 ===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1aa20 ===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1aa20 ===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1aa20 ===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1aa20, ret=0 ===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ee30 ===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ee30, ret=0 ===> [12410, 12410] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ee30 ===> [12410, 12410] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ee30, ret=0 ===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ee30 ===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ee30 ===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ee30 ===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ee30, ret=0 ===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1acb0 ===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1acb0, ret=0 ===> [12413, 12413] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1acb0 ===> [12413, 12413] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1acb0, ret=0 ===> [119878, 119944] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1b010 ===> [119878, 119944] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1b010 ===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1acb0 ===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1acb0 ===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1acb0 ===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1acb0, ret=0 ===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1bf80 ===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1bf80, ret=0 ===> [119876, 119876] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1bf80 ===> [119876, 119876] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1bf80, ret=0 ===> [12410, 12467] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ee30 ===> [12410, 12467] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ee30 ===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1aa20 ===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1aa20 ===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1bf80 ===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1bf80 ===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1bf80 ===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1bf80, ret=0 ===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ac60 ===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ac60, ret=0 ===> [12412, 12412] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ac60 ===> [12412, 12412] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ac60, ret=0 ===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ac60 ===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ac60 ===> [12413, 12465] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1acb0 ===> [12413, 12465] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1acb0 ===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ac60 ===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ac60, ret=0 ===> [119876, 119942] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1bf80 ===> [119876, 119942] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1bf80 ===> [12412, 12464] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ac60 ===> [12412, 12464] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ac60 ===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1adf0 ===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1adf0, ret=0 ===> [119879, 119879] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1adf0 ===> [119879, 119879] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1adf0, ret=0 ===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1adf0 ===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1adf0 ===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1b020 ===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1b020, ret=0 ===> [12411, 12411] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1b020 ===> [12411, 12411] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1b020, ret=0 ===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1b020 ===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1b020 ===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1b020 ===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1b020, ret=0 ===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1adf0 ===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1adf0, ret=0 ===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ee90 ===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ee90, ret=0 ===> [12410, 12410] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ee90 ===> [12410, 12410] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ee90, ret=0 ===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1b070 ===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1b070, ret=0 ===> [12410, 12410] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ee90 ===> [12410, 12410] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ee90 ===> [12410, 12410] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ee90 ===> [12410, 12410] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ee90, ret=0 ===> [119878, 119878] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1b070 ===> [119878, 119878] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1b070, ret=0 ===> [12411, 12466] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1b020 ===> [12411, 12466] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1b020 ===> [119879, 119945] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1adf0 ===> [119879, 119945] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1adf0 ===> [119879, 119879] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ad90 ===> [119879, 119879] GDRCopy Checkpoint gdr_unmap: 2: mh=0x7e1ad90 ===> [119879, 119879] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ad90, ret=22 ===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1b070 ===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1b070 ===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1b070 ===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1b070, ret=0 ===> [12410, 12410] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ee90 ===> [12410, 12410] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ee90 ===> [119878, 119944] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1b070 ===> [119878, 119944] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1b070 ===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1aa80 ===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1aa80, ret=0 ===> [119877, 119877] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1aa80 ===> [119877, 119877] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1aa80, ret=0 ===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1aa80 ===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1aa80 ===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1aa80 ===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1aa80, ret=0 ===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1bfe0 ===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1bfe0, ret=0 ===> [119876, 119876] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1bfe0 ===> [119876, 119876] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1bfe0, ret=0 ===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1aa80 ===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1aa80 ===> [119877, 119877] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1aa20 ===> [119877, 119877] GDRCopy Checkpoint gdr_unmap: 2: mh=0x7e1aa20 ===> [119877, 119877] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1aa20, ret=22 ===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1bfe0 ===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1bfe0 ===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1bfe0 ===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1bfe0, ret=0 ===> [119876, 119942] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1bfe0 ===> [119876, 119942] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1bfe0 ==== backtrace (tid: 119877) ==== 0 0x00000000000168c0 __funlockfile() ???:0 1 0x000000000000c59e cxil_unmap() /workspace/src/github.hpe.com/hpe/hpc-shs-libcxi/WORKSPACE/BUILD/libcxi-0.9/src/libcxi.c:945 2 0x00000000000a47cb cxip_unmap() :0 3 0x000000000008c165 cxip_rma_cb() cxip_rma.c:0 4 0x00000000000adfe5 cxip_evtq_progress() :0 5 0x0000000000081695 cxip_ep_progress() :0 6 0x000000000008b6c9 cxip_cntr_read() cxip_cntr.c:0 7 0x000000000000e7d3 nvshmemt_libfabric_quiet() /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:441 8 0x00000000000d653a nvshmem_quiet() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/comm/quiet.cpp:51 9 0x000000000004525d nvshmemi_barrier() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:19 10 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:39 11 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:41 12 0x0000000000500a66 cufftMpDestroyReshape() ???:0 13 0x00000000004ff598 cufftMpDestroyReshape() ???:0 14 0x000000000015893a cufftMpAttachComm() ???:0 15 0x00000000004e058f cufftMpDestroyReshape() ???:0 16 0x00000000004e0a85 cufftMpDestroyReshape() ???:0 17 0x000000000014cb6e cufftMpAttachComm() ???:0 18 0x000000000011bf4f cufftXtSetCallbackSharedSize() ???:0 19 0x0000000000147511 cufftXtMakePlanGuru64() ???:0 20 0x0000000000148105 cufftXtMakePlanMany() ???:0 21 0x0000000000145d7d cufftMakePlanMany64() ???:0 22 0x00000000001461bf cufftMakePlanMany() ???:0 23 0x0000000000146386 cufftMakePlan3d() ???:0 24 0x00000000004066b2 run_r2c_c2r_slabs() ???:0 25 0x0000000000407d4b main() ???:0 26 0x000000000003529d __libc_start_main() ???:0 27 0x00000000004057ea _start() /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120 ================================= -------------------------------------------------------------------------- This help section is empty because PRRTE was built without Sphinx. -------------------------------------------------------------------------- make: *** [Makefile:18: run] Error 139 ```

After that, I tried to do a bit more work. First, I added another print in _gdr_unpin_buffer after the free operation, because your print after the LIST_REMOVE did show up in the failure scenario I pasted above.

Sample Diff ```diff --- a/src/gdrapi.c +++ b/src/gdrapi.c @@ -302,6 +302,7 @@ static int _gdr_unpin_buffer(gdr_t g, gdr_mh_t handle) LIST_REMOVE(mh, entries); printf("===> [%d, %d] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=%p\n", getpid(), gettid(), mh); free(mh); + printf("===> [%d, %d] GDRCopy Checkpoint _gdr_unpin_buffer: 3\n", getpid(), gettid()); return ret; } ```

On top of that, per the request to separate the output by node, I made a few more changes to the source to prefix the hostname in each of the prints. These changes are available on my fork of gdrcopy (https://github.com/tylerjereddy/gdrcopy) on feature branch treddy_gh_296 (just builds a few commits on top of your branch).

Now, when I run the 2-node cuFFTMp reproducer with that version of gdrcopy I see a double free or corruption (out) error, apparently at the free() call in _gdr_unpin_buffer(). Full log: out_improved_prints.txt

That would be consistent with your original instrumented code as well, with the list removal "succeeding," but the free failing inside of _gdr_unpin_buffer. I think you were already worried about a double free somewhere above.

I ran the reproducer two more times, and this was not always the case however--sometimes we get the printf after the free operation in _gdr_unpin_buffer() and then the backtrace happens after that:

``` ===> [nid001196, 36860, 36913] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e21190 ===> [nid001468, 5905, 5958] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e21530 ===> [nid001468, 5905, 5958] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e21530 ===> [nid001468, 5905, 5958] GDRCopy Checkpoint _gdr_unpin_buffer: 3 ===> [nid001196, 36860, 36913] GDRCopy Checkpoint _gdr_unpin_buffer: 3 ===> [nid001196, 36860, 36860] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e21190 ===> [nid001196, 36860, 36860] GDRCopy Checkpoint gdr_unmap: 2: mh=0x7e21190 ===> [nid001468, 5906, 5956] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1d8f0 ===> [nid001468, 5906, 5956] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1d8f0 ===> [nid001196, 36860, 36860] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e21190, ret=22 ===> [nid001468, 5906, 5956] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1d8f0 ===> [nid001468, 5906, 5956] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1d8f0, ret=0 ===> [nid001468, 5906, 5956] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1d8f0 ===> [nid001468, 5906, 5956] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1d8f0 ===> [nid001468, 5906, 5956] GDRCopy Checkpoint _gdr_unpin_buffer: 3 ==== backtrace (tid: 36860) ==== 0 0x00000000000168c0 __funlockfile() ???:0 1 0x000000000000c59b cxil_unmap() /workspace/src/github.hpe.com/hpe/hpc-shs-libcxi/WORKSPACE/BUILD/libcxi-0.9/src/libcxi.c:945 2 0x00000000000a47cb cxip_unmap() :0 3 0x000000000008c165 cxip_rma_cb() cxip_rma.c:0 4 0x00000000000adfe5 cxip_evtq_progress() :0 5 0x0000000000081695 cxip_ep_progress() :0 6 0x0000000000089f99 cxip_util_cq_progress() :0 7 0x000000000004a020 ofi_cq_readfrom() :0 8 0x000000000000dff4 nvshmemt_libfabric_progress() /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:395 9 0x000000000000dff4 nvshmemt_libfabric_progress() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:132 10 0x000000000000e7c8 nvshmemt_libfabric_quiet() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:390 11 0x00000000000d653a nvshmem_quiet() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/comm/quiet.cpp:51 12 0x000000000004525d nvshmemi_barrier() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:19 13 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:39 14 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:41 15 0x0000000000500a66 cufftMpDestroyReshape() ???:0 16 0x00000000004ff598 cufftMpDestroyReshape() ???:0 17 0x000000000015893a cufftMpAttachComm() ???:0 18 0x00000000004e058f cufftMpDestroyReshape() ???:0 19 0x00000000004e0a85 cufftMpDestroyReshape() ???:0 20 0x000000000014cb6e cufftMpAttachComm() ???:0 21 0x000000000011bf4f cufftXtSetCallbackSharedSize() ???:0 22 0x0000000000147511 cufftXtMakePlanGuru64() ???:0 23 0x0000000000148105 cufftXtMakePlanMany() ???:0 24 0x0000000000145d7d cufftMakePlanMany64() ???:0 25 0x00000000001461bf cufftMakePlanMany() ???:0 26 0x0000000000146386 cufftMakePlan3d() ???:0 27 0x00000000004066b2 run_r2c_c2r_slabs() ???:0 28 0x0000000000407d4b main() ???:0 29 0x000000000003529d __libc_start_main() ???:0 30 0x00000000004057ea _start() /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120 ================================= ```

Of course, things are not fully deterministic, and I saw the double free error happening in what appears to be other parts of the control flow as well:

``` ===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e34350, ret=0 ===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_map: 1: mh=0x7e34350 ===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_map: 2: mh=0x7e34350, ret=0 ===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e3a2c0 ===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e3a2c0, ret=0 ===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_map: 1: mh=0x7e3a2c0 ===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_map: 2: mh=0x7e3a2c0, ret=0 ===> [nid001196, 37927, 37927] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e312a0 ===> [nid001196, 37927, 37927] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e312a0, ret=0 ===> [nid001196, 37927, 37927] GDRCopy Checkpoint gdr_map: 1: mh=0x7e312a0 ===> [nid001196, 37927, 37927] GDRCopy Checkpoint gdr_map: 2: mh=0x7e312a0, ret=0 ===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e3b450 ===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e3b450, ret=0 ===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_map: 1: mh=0x7e3b450 ===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_map: 2: mh=0x7e3b450, ret=0 ===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e354e0 ===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e354e0, ret=0 ===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_map: 1: mh=0x7e354e0 ===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_map: 2: mh=0x7e354e0, ret=0 ===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e3c5e0 ===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e3c5e0, ret=0 ===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_map: 1: mh=0x7e3c5e0 ===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_map: 2: mh=0x7e3c5e0, ret=0 ===> [nid001196, 37927, 37927] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e32430 ===> [nid001196, 37927, 37927] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e32430, ret=0 ===> [nid001196, 37927, 37927] GDRCopy Checkpoint gdr_map: 1: mh=0x7e32430 ===> [nid001196, 37927, 37927] GDRCopy Checkpoint gdr_map: 2: mh=0x7e32430, ret=0 ===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e3d770 double free or corruption (out) [nid001468:06925] *** Process received signal *** [nid001468:06925] Signal: Aborted (6) [nid001468:06925] Signal code: (-6) ===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e3d770, ret=0 ===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_map: 1: mh=0x7e3d770 ===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_map: 2: mh=0x7e3d770, ret=0 ===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e36670 ===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e36670, ret=0 ```

I'm guessing your team has already flushed the code through an address sanitizer at some point though? This is confusing! What can I do next to help get to the bottom of it?

pakmarkthub commented 6 months ago

There are multiple things that went wrong here. Let's start with the raw output from my instrumented code without your patch.

  1. The output from the instrumented code is in [pid, tid] format. I think the caller uses multiple threads here. I didn't see locking when I reviewed the libfabric code. We probably see some racing. But the experimental branch you are using should not have this problem because I added a global lock. So, I will remove racing inside GDRCopy from the discussion for now.

  2. Let's stick some lines from the same process that reported segfault together. They should be in chronological order.

===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1aa20, ret=0
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1aa20, ret=0
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1aa20
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1aa20
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1aa20
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1aa20, ret=0
===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1aa20
===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_unmap: 2: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1aa20, ret=22
==== backtrace (tid: 119877) ====
 0 0x00000000000168c0 __funlockfile()  ???:0
 1 0x000000000000c59e cxil_unmap()  /workspace/src/github.hpe.com/hpe/hpc-shs-libcxi/WORKSPACE/BUILD/libcxi-0.9/src/libcxi.c:945
 2 0x00000000000a47cb cxip_unmap()  :0

As shown, they were dealing with the same mh object based on the address. The caller called gdr_unmap two times on the same object! What made it worse is that they called gdr_unmap after _gdr_unpin_buffer. The mh object had already been destroyed before the last gdr_unmap was called. Note that mh is directly translated from handle that the caller passes to GDRCopy API. Basically, this is a use-after-free problem inside the caller.

  1. Now, let's look at your https://github.com/NVIDIA/gdrcopy/files/15312969/out_improved_prints.txt.
===> [nid001196, 34247, 34247] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1bf60
===> [nid001196, 34247, 34247] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1bf60
===> [nid001196, 34247, 34247] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1bf60
===> [nid001196, 34247, 34247] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1bf60, ret=0
...
===> [nid001196, 34247, 34247] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x14c509661a60
===> [nid001196, 34247, 34247] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x14c509661a60
double free or corruption (out)
[nid001196:34247] *** Process received signal ***
[nid001196:34247] Signal: Aborted (6)
[nid001196:34247] Signal code:  (-6)

The mh address that the caller passed to _gdr_unpin_buffer was 0x14c509661a60. That was completely in a different memory region as what the other mh objects were. In fact, I cannot find a single line printed from gdr_pin_buffer that shows that this mh=0x14c509661a60 was created by GDRCopy. The caller probably passed in an unrelated object here.

tylerjereddy commented 6 months ago

So, suspicion would be on the libfabric side, but not as far up the control flow as NVSHMEM?

pakmarkthub commented 6 months ago

IIUC, NVSHMEM does not use GDRCopy directly in that environment. I don't know the libfabric programming model. Is it thread safe? Does it require special handling from the libfabric caller (NVSHMEM in this case)? My suggestion is to move up one step at a time. Items 2 and 3 are clearly a mistake from GDRCopy's caller. Even if we make GDRCopy thread safe, you will still run into this segfault issue.

tylerjereddy commented 6 months ago

I think I've found evidence of a spin lock in libfabric not having the right guards before unmapping and unpinning per the issue cross-listed above. I had to paste a bunch of internal gdrcopy stuff to check if the unmapping had already happened in my sample patch there, and I still got crashes, but that particular pathology did seem to disappear when I hacked that in...

pakmarkthub commented 6 months ago

Looking at the log you posted in the libfabric issue 10041, you have

[112670, 112670] cuda_gdrcopy_dev_unregister() checkpoint 2 after spin lock and before unmap gdrcopy->mh=(nil)
===> [nid001252, 112670, 112670] GDRCopy Checkpoint gdr_unmap: 1: mh=(nil)
...
==== backtrace (tid: 112670) ====
 0 0x00000000000168c0 __funlockfile()  ???:0
 1 0x00000000000023fb gdr_unmap()  /lustre/scratch5/treddy/march_april_2024_testing/github_projects/gdrcopy/src/gdrapi.c:459
 2 0x0000000000032e33 cuda_gdrcopy_dev_unregister()  :0
 3 0x00000000000a4bed cxip_unmap()  :0
...

So, libfabric passes NULL to gdr_unmap. That is likely the source of your segfault.

tylerjereddy commented 6 months ago

I think we agree on that, though I wasn't convinced that guarding against that was sufficient to fix all the problems, since I saw other backtraces after that was protected. I'm hoping to make another push at getting that working soon..