Unified virtual memory failure in multi-GPU box

paboyle commented 4 years ago

See:

https://github.com/open-mpi/ompi/issues/6799

I'm running OpenMPI 4.0.1 self compiled over Omnipath with IFS 10.8, as distributed by Intel.

The boards are

HPE XA with 4 x Nvidia Volta V100 GPU's and 4 OPA 100Gb ports on two PCIe dual port HFI cards. The good news is that MPI appears to work between nodes, where these buffers are sent from explicit device memory.

However when I run four MPI ranks per node and ensure that communications between ranks use unified virtual memory (UVM) allocated with cudaMallocManaged(), I get a failure:

Benchmark_dwf: CUDA failure: cuIpcGetMemHandle() (at /nfs/site/home/phcvs2/gitrepo/ifs-all/Ofed_Delta/rpmbuild/BUILD/libpsm2-11.2.23/ptl_am/am_reqrep_shmem.c:1977)returned 1 
Error returned from CUDA function.

When I run with a patch to the code to use explicit host memory the code succeeds. However, I want to be able to run these buffers from UVM and have loops with either host or device execution policy fill them, as that is how the code was designed to operate.

In the meantime tracked all the way back through PSM2 to a bug in CUDA (recompiled PSM2 from 10.8 released source, and inserted printf debugging, had a lovely 24h...)

Source:

https://github.com/intel/opa-psm2/blob/816c0dbdf911dba097dcbb09f023c5113713c33e/ptl_am/am_reqrep_shmem.c#L1973

Edited psm_user.h and _psmi_is_cuda_mem to instrument.

Appears that a bug in cuPointerGetAttribute causes PSM2 to interpret UVM pointers as Device pointers, and try to use Cuda IPC to communicate intra-node, with subsequent failure.

A simple 16 line of code example that fails: ———————

#include <cuda.h>
#include <cuda_runtime_api.h>
#include <cassert>
#include <stdio.h>
int main(int argc, char**argv)
{
 unsigned long bytes = 1024*1024;
 void *ptr;
 auto err = cudaMallocManaged((void **)&ptr,bytes);
 assert(err == cudaSuccess );
 CUmemorytype mt;
 auto perr = cuPointerGetAttribute( &mt, CU_POINTER_ATTRIBUTE_MEMORY_TYPE, (CUdeviceptr) ptr);
 assert(perr == cudaSuccess );
 printf("alignedAllocator %lx %d\n",(uint64_t )ptr, mt);fflush(stdout);
 assert (mt == CU_MEMORYTYPE_UNIFIED);
}

CUDA is supposed to returning one of:

CU_MEMORYTYPE_HOST = 0x01 Host memory 
CU_MEMORYTYPE_DEVICE = 0x02 Device memory 
CU_MEMORYTYPE_ARRAY = 0x03 Array memory 
CU_MEMORYTYPE_UNIFIED = 0x04 Unified device or host memory

But it is instead reporting that Unified memory is device memory incorrectly, causing PSM2 to do bad things, like think it can use Cuda IPC and then failing as above.

—————— CUDA 9.2 —————— nvcc simple.cc -o simple.x -lcuda

./simple.x 
alignedAllocator 7fff88000000 2
simple.x: simple.cc:17: int main(int, char**): Assertion `mt == CU_MEMORYTYPE_UNIFIED' failed.
Aborted

This causes the test in PSM2 to interpret UVM pointers as CU_MEMORYTYPE_DEVICE and try and fail to use Cuda IPC on UVM.

—————— CUDA 9.1 .. Officially supported with IFS 10.8 ——————

./simple.x 
alignedAllocator 7fff88000000 2
simple.x: simple.cc:17: int main(int, char**): Assertion `mt == CU_MEMORYTYPE_UNIFIED' failed.
Aborted

—————— CUDA 10.1 ——————

./simple.x 
simple.x: simple.cc:12: int main(int, char**): Assertion `err == cudaSuccess' failed.
Aborted

Behaves differently, returns error, but still does not produce CU_MEMORYTYPE_UNIFIED => Bug in current version of CUDA.

However, absence of cudaSuccess will probably make PSM2 work. Except on our supercomputer system the kernel driver in IFS 10.8 does not support CUDA 10.1, failing with:

CUDA driver version is insufficient for CUDA runtime version

Will report this to CUDA, and get the kernel driver updated, which requires persuading a reluctant vendor that this is a genuine support need.

Even the CUDA 10.1 version does not return CU_MEMORYTYPE_UNIFIED and returns invalid arguments error, so it is still buggy, but my reading of the PSM2 source is that this will now be a benign bug since the absences of cudaSuccess return will stop PSM2 being tricked by CUDA.

paboyle commented 4 years ago

Feedback from NVIDIA:

They said:

"The documentation for cuPointerGetAttribute is here

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html#group__CUDA__UNIFIED_1g0c28ed0aff848042bc0533110e45820c

When testing for the CU_POINTER_ATTRIBUTE_MEMORY_TYPE, the possible values for data are CU_MEMORYTYPE_HOST and CU_MEMORYTYPE_DEVICE only. The other two options are never returned by this call.

A different call to cuPointerGetAttribute is needed to determine if a pointer points to managed memory. Testing for the CU_POINTER_ATTRIBUTE_IS_MANAGED, sets data to either 0 or 1. "

Thus, I think they say it looks like psm_user.h _psmi_is_cuda_mem is being used to indicate the CUDA IPC can be used, but is false triggering when the memory is managed, according to the CUDA documentation.

However, I disagree with them:

I have read the documentation, and it is actually unclear what the behaviour is defined as under unified memory pointers.

In fact it seems to miss a few words to say what pointer type is being talked about in places.

"Returns in *data the physical memory type of the memory that ptr addresses as a CUmemorytype enumerated value."

Can reasonably be interpreted that the Unified memory type should return CU_MEMORYTYPE_UNIFIED, on account of it being unified memory type.

It seems very very odd to have CU_MEMORYTYPE_UNIFIED and a query function for MEMORY_TYPE that is unable to return some of the values.

Whatever happens: the combination of CUDA and PSM2 has a bug until this is fixed and the usage agreed upon.

paboyle commented 4 years ago

Following appears to fix the issue (psm_user.h).

PSMI_ALWAYS_INLINE(
int
_psmi_is_cuda_mem(void *ptr))
{
        CUresult cres;
        CUmemorytype mt;
        unsigned uvm;

        cres = psmi_cuPointerGetAttribute(&mt, CU_POINTER_ATTRIBUTE_MEMORY_TYPE, (CUdeviceptr) ptr);

        if ((cres == CUDA_SUCCESS) && (mt == CU_MEMORYTYPE_DEVICE)) {

          cres = psmi_cuPointerGetAttribute(&uvm, CU_POINTER_ATTRIBUTE_IS_MANAGED, (CUdeviceptr) ptr);

          if ((cres==CUDA_SUCCESS) && (uvm==0) ) return 1;
          else                                   return 0;

        } else {

          return 0;

        }
}

Personally, I think cuPointerGetAttribute looks to be in the wrong, though

paboyle commented 4 years ago

The cuda 10.1 execution of the code was run on a system with the cuda 9.2 kernel driver. it is possible (Tim Lanfear produced same output with cuda 9.2 and 10.1) that the 10.1 behaviour will match 9.2 once the kernel driver is updated.

This is on a centrally run supercomputer, so I can't update kernel drivers to check.

paboyle commented 4 years ago

Received the following from Nvidia, and they confirm this is their defined behaviour of cuPointerGetAttribute. Perhaps even worse, I think this means the returned class will vary page by page with UVM according to location:

The behaviour of cuPointerGetAttribute() with the CU_POINTER_ATTRIBUTE_MEMORY_TYPE attribute is that it will return the physical memory type of the memory that the pointer addresses. With a pointer to managed memory, at any given time this could be host memory or device memory depending on where the data is mapped at that time. Putting it another way, ptr is a virtual address backed by some physical storage, and CU_POINTER_ATTRIBUTE_MEMORY_TYPE allows you to find where the pointer is at this moment in time.

While the documentation is supposed to make it clear that only these two values can be returned (if cuPointerGetAttribute() is successful), I do think it could be improved, especially since reusing the CUmemorytype enum might suggest otherwise.

I'm not familiar with PSM2, but from the context I gather it is going to use cudaIpcGetMemHandle() (or driver API equivalent) and then pass the returned pointer to another process. Unfortunately, the IPC API is not supported for managed memory allocations. That's because in the source process the CUDA Runtime is able to intercept page faults on the allocation and handle them, but the receiving process does not know that the pointer is managed by the CUDA driver and so tries to use the normal kernel page fault handler, which doesn't know about the GPU (*).

Without knowing the rest of the code, I'm not sure if your proposed fix would work – it depends what PSM2 does if it determines that the pointer is "not a CUDA pointer."

It may also be helpful to clarify the difference between "unified" and "managed" memory. Unified Memory is the ability for a virtual memory allocation to be physically located in either device or host memory, and for page faults from CPU or GPU to be handled to access the correct memory or migrate the page as appropriate (e.g. based on heuristics). Managed Memory is unified memory that is managed by the CUDA driver rather than the Linux kernel.

(*) As you may know, we have been working with the Linux community on the Heterogeneous Memory Manager (HMM) in the Linux kernel. HMM allows the Linux kernel to understand about pointers in non-conventional memory (e.g. GPU memory) which means that we no longer need "managed memory" since the memory can be handled by the kernel instead of by the CUDA driver. With HMM, you would be able to use malloc() instead of cudaMallocManaged() and you should be able to pass the pointer via IPC and have it behave correctly.

mwheinz commented 3 years ago

Reviewing old issues - it appears that Adam submitted a pair of patches for this back in 2019.

@paboyle - I know it's been a ridiculously long time but do you know if this was fixed in more recent IFS releases or is this still a problem for you?

mwheinz commented 3 years ago

Pete has agreed that PSM2 was patched to correct the issue.

cornelisnetworks / opa-psm2

Unified virtual memory failure in multi-GPU box #41