UCX incompatible with CUDA.jl memory pool

Chiil commented 2 years ago

I am not sure whether this is a MPI.jl issue or something from our local supercomputer, but I have a failing Alltoall in my Julia code, whereas the identical code in C++ works, showing that the problem does not lie in our MPI or CUDA install. I do not really know how to proceed from here. I got excellent help in making sure that the libraries are set up correctly at https://discourse.julialang.org/t/cuda-aware-mpi-works-on-system-but-not-for-julia/75060, but the problem remains The error is:

[1642966632.503811] [gcn21:3820255:0]    gdr_copy_md.c:122  UCX  ERROR gdr_pin_buffer failed. length :65536 ret:22

signal (11): Segmentation fault
in expression starting at /gpfs/scratch1/shared/chiel/MicroHH.jl/test/alltoall_test.jl:32
[1642966632.504314] [gcn21:3820254:0]    gdr_copy_md.c:122  UCX  ERROR gdr_pin_buffer failed. length :65536 ret:22

signal (11): Segmentation fault
in expression starting at /gpfs/scratch1/shared/chiel/MicroHH.jl/test/alltoall_test.jl:32
uct_gdr_copy_mkey_pack at /tmp/jenkins/build/UCXCUDA/1.10.0/GCCcore-10.3.0-CUDA-11.3.1/ucx-1.10.0/src/uct/cuda/gdr_copy/gdr_copy_md.c:68
uct_gdr_copy_mkey_pack at /tmp/jenkins/build/UCXCUDA/1.10.0/GCCcore-10.3.0-CUDA-11.3.1/ucx-1.10.0/src/uct/cuda/gdr_copy/gdr_copy_md.c:68
ucp_mem_type_pack at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/dt/dt.c:87
ucp_dt_pack at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/dt/dt.c:123
ucp_mem_type_pack at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/dt/dt.c:87
ucp_dt_pack at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/dt/dt.c:123
ucp_tag_pack_eager_common at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/eager_snd.c:31 [inlined]
ucp_tag_pack_eager_only_dt at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/eager_snd.c:44
ucp_tag_pack_eager_common at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/eager_snd.c:31 [inlined]
ucp_tag_pack_eager_only_dt at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/eager_snd.c:44
uct_mm_ep_am_common_send at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/uct/sm/mm/base/mm_ep.c:292 [inlined]
uct_mm_ep_am_bcopy at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/uct/sm/mm/base/mm_ep.c:353
uct_ep_am_bcopy at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/uct/api/uct.h:2650 [inlined]
ucp_do_am_bcopy_single at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/proto/proto_am.inl:37 [inlined]
ucp_tag_eager_bcopy_single at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/eager_snd.c:133
uct_mm_ep_am_common_send at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/uct/sm/mm/base/mm_ep.c:292 [inlined]
uct_mm_ep_am_bcopy at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/uct/sm/mm/base/mm_ep.c:353
uct_ep_am_bcopy at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/uct/api/uct.h:2650 [inlined]
ucp_do_am_bcopy_single at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/proto/proto_am.inl:37 [inlined]
ucp_tag_eager_bcopy_single at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/eager_snd.c:133
ucp_request_try_send at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/core/ucp_request.inl:242 [inlined]
ucp_request_send at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/core/ucp_request.inl:267 [inlined]
ucp_tag_send_req at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/tag_send.c:116 [inlined]
ucp_tag_send_nbx at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/tag_send.c:296
mca_pml_ucx_send at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/openmpi/mca_pml_ucx.so (unknown line)
ompi_coll_base_sendrecv_actual at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/libmpi.so (unknown line)
ompi_coll_base_alltoall_intra_pairwise at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/libmpi.so (unknown line)
ompi_coll_tuned_alltoall_intra_dec_fixed at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/openmpi/mca_coll_tuned.so (unknown line)
MPI_Alltoall at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/libmpi.so (unknown line)
Alltoall! at /home/chiel/.julia/packages/MPI/08SPr/src/collective.jl:480
unknown function (ip: 0x14eefc4fc48f)
ucp_request_try_send at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/core/ucp_request.inl:242 [inlined]
ucp_request_send at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/core/ucp_request.inl:267 [inlined]
ucp_tag_send_req at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/tag_send.c:116 [inlined]
ucp_tag_send_nbx at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/tag_send.c:296
mca_pml_ucx_send at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/openmpi/mca_pml_ucx.so (unknown line)
ompi_coll_base_sendrecv_actual at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/libmpi.so (unknown line)
ompi_coll_base_alltoall_intra_pairwise at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/libmpi.so (unknown line)
ompi_coll_tuned_alltoall_intra_dec_fixed at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/openmpi/mca_coll_tuned.so (unknown line)
MPI_Alltoall at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/libmpi.so (unknown line)

The code that triggers this error is:

using CUDA
using MPI

np = 2

MPI.Init()

comm = MPI.COMM_WORLD
mpiid = MPI.Comm_rank(comm)
print("The MPI rank is: $mpiid\n")

device!(mpiid)
cuda_bb = ENV["JULIA_CUDA_USE_BINARYBUILDER"]
print("The CUDA device is: $(device()), JULIA_CUDA_USE_BINARYBUILDER is $cuda_bb\n")

n = 1024
data_cpu = rand(n)
data_out_cpu = similar(data_cpu)
data = CuArray(data_cpu)
data_out = similar(data)

# Test the alltoall on the CPU
mpi_data_cpu = MPI.UBuffer(data_cpu, 512)
mpi_data_out_cpu = MPI.UBuffer(data_out_cpu, 512)
@time MPI.Alltoall!(mpi_data_cpu, mpi_data_out_cpu, comm)
@time MPI.Alltoall!(mpi_data_cpu, mpi_data_out_cpu, comm)

# Test the alltoall on the GPU
print("$mpiid has CUDA: $(MPI.has_cuda())\n")
mpi_data = MPI.UBuffer(data, 512)
mpi_data_out = MPI.UBuffer(data_out, 512)
@time MPI.Alltoall!(mpi_data, mpi_data_out, comm)
@time MPI.Alltoall!(mpi_data, mpi_data_out, comm)

# Close the MPI.
MPI.Finalize()

The equivalent working C++ code is:

#include <iostream>
#include <vector>
#include <mpi.h>
#include <cuda_runtime_api.h>
#include <cuda.h>
#include <chrono>

int main()
{
    MPI_Init(NULL, NULL);

    int n, id;
    MPI_Comm_size(MPI_COMM_WORLD, &n);
    MPI_Comm_rank(MPI_COMM_WORLD, &id);

    const size_t size_tot = 1024*1024*1024;
    const size_t size_max = size_tot / n;

    // CPU TEST
    std::vector<double> a_cpu_in (size_tot);
    std::vector<double> a_cpu_out(size_tot);
    std::fill(a_cpu_in.begin(), a_cpu_in.end(), id);

    std::cout << id << ": Starting CPU all-to-all\n";
    auto time_start = std::chrono::high_resolution_clock::now();
    MPI_Alltoall(
            a_cpu_in .data(), size_max, MPI_DOUBLE,
            a_cpu_out.data(), size_max, MPI_DOUBLE,
            MPI_COMM_WORLD);
    auto time_end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration<double, std::milli>(time_end-time_start).count();
    std::cout << id << ": Finished CPU all-to-all in " << std::to_string(duration) << " (ms)\n";

    // GPU TEST
    int id_local = id % 4;
    cudaSetDevice(id_local);
    double* a_gpu_in;
    double* a_gpu_out;
    cudaMalloc((void **)&a_gpu_in , size_tot * sizeof(double));
    cudaMalloc((void **)&a_gpu_out, size_tot * sizeof(double));
    cudaMemcpy(a_gpu_in, a_cpu_in.data(), size_tot*sizeof(double), cudaMemcpyHostToDevice);

    int id_gpu;
    cudaGetDevice(&id_gpu);
    std::cout << id << ", " << id_local << ", " << id_gpu << ": Starting GPU all-to-all\n";
    time_start = std::chrono::high_resolution_clock::now();
    MPI_Alltoall(
            a_gpu_in , size_max, MPI_DOUBLE,
            a_gpu_out, size_max, MPI_DOUBLE,
            MPI_COMM_WORLD);
    time_end = std::chrono::high_resolution_clock::now();
    duration = std::chrono::duration<double, std::milli>(time_end-time_start).count();

    std::cout << id << ", " << id_local << ", " << id_gpu << ": Finished GPU all-to-all in " << std::to_string(duration) << " (ms)\n";

    MPI_Finalize();
    return 0;
}

Chiil commented 2 years ago

In the discussion on Discourse somebody suggested to use export JULIA_CUDA_MEMORY_POOL=none and this solves the problem. I do not know though whether this is a bug, because it would be great if the pool and the CUDA-aware MPI can be combined.

simonbyrne commented 2 years ago

Have you tried https://juliaparallel.github.io/MPI.jl/stable/knownissues/#Memory-cache

Chiil commented 2 years ago

Yes, I tried that as well. Only the export JULIA_CUDA_MEMORY_POOL=none solves my problems.

simonbyrne commented 2 years ago

Ah ok, upstream issue is https://github.com/openucx/ucx/issues/7110

JuliaParallel / MPI.jl

UCX incompatible with CUDA.jl memory pool #532