dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.36k stars 3k forks source link

pin memory problem in dgx-a100 #4388

Open zqj2333 opened 2 years ago

zqj2333 commented 2 years ago

❓ Questions and Help

I used dgx-a100(8*a100) to train graphsage with UnifiedTensor, while it seems that there is something wrong. I thought a lot but found nothing about the reason and how to solve it. So what is the reason of the problem? `U{5@GS%@UQ8V8CP8$2T68V

zqj2333 commented 2 years ago

If source code could help, I would send you my source code. Thanks a lot.

kkranen commented 2 years ago

I've encountered this error I believe. In my case, the issue was that DGL was attempting to pin a subgraph in which there's an empty edge list for one of the relation types.

yaox12 commented 2 years ago

Can you provide env information such as:

zqj2333 commented 2 years ago

Can you provide env information such as:

  • DGL Version (e.g., 1.0):
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
  • OS (e.g., Linux):
  • How you installed DGL (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version (if applicable):
  • GPU models and configuration (e.g. V100):
  • Any other relevant information:

Thanks for your reply.

yaox12 commented 2 years ago

After some investigation, I think this issue is caused by IOMMU enabled in the OS. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#iommu-on-linux for more details.

To verify it, you can run the following code to see if the error still happens.

import torch
x = torch.arange(10).reshape(5, 2)
x.share_memory_()
cudart = torch.cuda.cudart()
r = cudart.cudaHostRegister(x.data_ptr(), x.numel() * x.element_size(), 0)
assert x.is_shared()
assert x.is_pinned()
zqj2333 commented 2 years ago

After some investigation, I think this issue is caused by IOMMU enabled in the OS. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#iommu-on-linux for more details.

To verify it, you can run the following code to see if the error still happens.

import torch
x = torch.arange(10).reshape(5, 2)
x.share_memory_()
cudart = torch.cuda.cudart()
r = cudart.cudaHostRegister(x.data_ptr(), x.numel() * x.element_size(), 0)
assert x.is_shared()
assert x.is_pinned()

In this link, it seems that I should disable IOMMU. I have test this code and there is no error, so what is the API in this code that disables the IOMMU? By the way, when I train graphsage with small graph, there is no error. But when I train with large graph, the above error happens. It seems that there are something related to size. Is there some size limitation of pin memory?

yaox12 commented 2 years ago

The code doesn't disable IOMMU. I just want to check if the problem is caused by DGL or not. This PyTorch code calls the same underlying CUDA API as DGL does.

By the way, when I train graphsage with small graph, there is no error.

If training with small graphs works well, it shouldn't be the IOMMU issue.

But when I train with large graph, the above error happens. It seems that there are something related to size. Is there some size limitation of pin memory?

The size of pin memory cannot exceed the size of physical CPU RAM. How big is your data? Regarding you are using UnifiedTensor, can you try replacing dgl.contrib.UnifiedTensor(x, ...) with cudart.cudaHostRegister(x.data_ptr(), x.numel() * x.element_size(), 0) and see if the error still happens?

zqj2333 commented 2 years ago

I attempt to use https://github.com/yaox12/dgl/blob/uva_sampling/examples/pytorch/graphsage/train_sampling_multi_gpu.py to train with paper100m, and I find the error happens between line 263 and line 268. The whole error is:

Process Process-5:
Traceback (most recent call last):
  File "/root/anaconda3/envs/dgl/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/root/anaconda3/envs/dgl/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/multiprocessing/pytorch.py", line 33, in decorated_function
    raise exception.__class__(trace)
dgl._ffi.base.DGLError: Traceback (most recent call last):
  File "/root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/multiprocessing/pytorch.py", line 21, in _queue_result
    res = func(*args, **kwargs)
  File "train_sampling_multi_gpu.py", line 74, in run
    train_nfeat = dgl.contrib.UnifiedTensor(train_nfeat, device=device)
  File "/root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/contrib/unified_tensor.py", line 78, in __init__
    self._array.pin_memory_()
  File "/root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/_ffi/ndarray.py", line 322, in pin_memory_
    check_call(_LIB.DGLArrayPinData(self.handle))
  File "/root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/_ffi/base.py", line 65, in check_call
    raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: [04:20:34] /opt/dgl/src/runtime/cuda/cuda_device_api.cc:183: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: OS call failed or operation not supported on this OS
Stack trace:
  [bt] (0) /root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7fb88691beaf]
  [bt] (1) /root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::CUDADeviceAPI::PinData(void*, unsigned long)+0xb4) [0x7fb886df1814]
  [bt] (2) /root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::PinData(DLTensor*)+0x16f) [0x7fb886c6658f]
  [bt] (3) /root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/libdgl.so(DGLArrayPinData+0x6) [0x7fb886c66606]
  [bt] (4) /root/anaconda3/envs/dgl/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fba4776e9dd]
  [bt] (5) /root/anaconda3/envs/dgl/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7fba4776e067]
  [bt] (6) /root/anaconda3/envs/dgl/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7fba477871e9]
  [bt] (7) /root/anaconda3/envs/dgl/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7fba47787c95]
  [bt] (8) python(_PyObject_MakeTpCall+0x3bf) [0x556b05cfd13f]

for all process.

yaox12 commented 2 years ago

Cannot reproduce... I ran with python train_sampling_multi_gpu.py --gpu 0,1 --dataset ogbn-papers100M --data-device uva and it works well. Can you change the following two lines

        train_nfeat = dgl.contrib.UnifiedTensor(train_nfeat, device=device)
        train_labels = dgl.contrib.UnifiedTensor(train_labels, device=device)

to

        cudart = th.cuda.cudart()
        cudart.cudaHostRegister(train_nfeat.data_ptr(),
            train_nfeat.numel() * train_nfeat.element_size(), 0)
        cudart.cudaHostRegister(train_labels.data_ptr(),
            train_labels.numel() * train_labels.element_size(), 0)

Though it will be a bit slower, but can help us figure out whether the error is caused by DGL or the OS.

zqj2333 commented 2 years ago

Cannot reproduce... I ran with python train_sampling_multi_gpu.py --gpu 0,1 --dataset ogbn-papers100M --data-device uva and it works well. Can you change the following two lines

        train_nfeat = dgl.contrib.UnifiedTensor(train_nfeat, device=device)
        train_labels = dgl.contrib.UnifiedTensor(train_labels, device=device)

to

        cudart = th.cuda.cudart()
        cudart.cudaHostRegister(train_nfeat.data_ptr(),
            train_nfeat.numel() * train_nfeat.element_size(), 0)
        cudart.cudaHostRegister(train_labels.data_ptr(),
            train_labels.numel() * train_labels.element_size(), 0)

Though it will be a bit slower, but can help us figure out whether the error is caused by DGL or the OS.

Hello, after I replace the code, there is no error. So it seems that there is something wrong in DGL. By the way, when I execute on v100, there is no error, which made me very confused.

yaox12 commented 2 years ago

Do you have ideas on this issue? @nv-dlasalle @davidmin7

zqj2333 commented 2 years ago

Cannot reproduce... I ran with python train_sampling_multi_gpu.py --gpu 0,1 --dataset ogbn-papers100M --data-device uva and it works well. Can you change the following two lines

        train_nfeat = dgl.contrib.UnifiedTensor(train_nfeat, device=device)
        train_labels = dgl.contrib.UnifiedTensor(train_labels, device=device)

to

        cudart = th.cuda.cudart()
        cudart.cudaHostRegister(train_nfeat.data_ptr(),
            train_nfeat.numel() * train_nfeat.element_size(), 0)
        cudart.cudaHostRegister(train_labels.data_ptr(),
            train_labels.numel() * train_labels.element_size(), 0)

Though it will be a bit slower, but can help us figure out whether the error is caused by DGL or the OS.

Hello~Could you give me a docker that is able to run python train_sampling_multi_gpu.py --gpu 0,1 --dataset ogbn-papers100M --data-device uva on a100?