[Bug][Dataloader] Device assertion error in multi-gpu runs when num_worker>0

chang-l commented 1 year ago

🐛 Bug

In multi-gpu examples, with the current code-base, if we have num_worker>1 in non-uva mode, DGL would crash with assertion error.

To Reproduce

Slightly modify the example multigpu/multi_gpu_node_classification.py for non-uva mode with num_worker=4
Run python multi_gpu_node_classification.py --gpu 0,1,2,3

Error msg and stack trace:

/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f8ac03031bc in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7f8ac02c90ea in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3cc (0x7f8ac03912ac in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x16873 (0x7f8ac0360873 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x249a6 (0x7f8ac036e9a6 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4fe99a (0x7f8b044d099a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)

Expected behavior

it should run through.

Environment

DGL Version (e.g., 1.0): 1.0
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 2.0
OS (e.g., Linux):
How you installed DGL (conda, pip, source): source
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version (if applicable): 12.0
GPU models and configuration (e.g. V100): A100x4
Any other relevant information:

Additional context

I think the assertion error might be due to the excessive overlap of compute stream (default stream) and prefetching stream by checking with profiler. Also, the crash goes away when use_alternate_stream=False.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

Rhett-Ying commented 1 year ago

@chang-l I cannot reproduce the error you hit. could you share what you exactly changed in multi_gpu_node_classification.py?

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/ubuntu/workspace/dgl_0/examples/pytorch/multigpu/multi_gpu_node_classification.py", line 201, in run
    train(
  File "/home/ubuntu/workspace/dgl_0/examples/pytorch/multigpu/multi_gpu_node_classification.py", line 127, in train
    train_dataloader = DataLoader(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dgl/dataloading/dataloader.py", line 940, in __init__
    raise ValueError(
ValueError: Expect graph and indices to be on the same device when use_uva=False.

chang-l commented 1 year ago

Thanks for picking it up @Rhett-Ying. You need to make sure train_idx and val_idx in cpu, i.e., commenting out the following two lines https://github.com/dmlc/dgl/blob/41baa0e4483b493e74c4dee6dc67abf6d120a1cc/examples/pytorch/multigpu/multi_gpu_node_classification.py#L190-L191 while set num_workers > 0. This is cpu-sampling (with num_workers) + multi-gpu training. Please let me know if you can reproduce at current version.

Rhett-Ying commented 1 year ago

@chang-l below is what I changed and it crashed with different error from yours.

python3 examples/pytorch/multigpu/multi_gpu_node_classification.py --gpu 0

ValueError: num_workers must be 0 if UVA sampling is enabled.

--- a/examples/pytorch/multigpu/multi_gpu_node_classification.py
+++ b/examples/pytorch/multigpu/multi_gpu_node_classification.py
@@ -132,7 +132,7 @@ def train(
         batch_size=1024,
         shuffle=True,
         drop_last=False,
-        num_workers=0,
+        num_workers=4,
         use_ddp=True,
         use_uva=use_uva,
     )
@@ -187,8 +187,8 @@ def run(proc_id, nprocs, devices, g, data, mode):
         rank=proc_id,
     )
     num_classes, train_idx, val_idx, test_idx = data
-    train_idx = train_idx.to(device)
-    val_idx = val_idx.to(device)
+    #train_idx = train_idx.to(device)
+    #val_idx = val_idx.to(device)
     g = g.to(device if mode == "puregpu" else "cpu")

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

Rhett-Ying commented 1 year ago

Hi @chang-l Do you have any comments on https://github.com/dmlc/dgl/issues/5526#issuecomment-1552331872? and do we need to prioritize this issue?

chang-l commented 1 year ago

Sorry, this issue falls out of my radar... I will give it another try this week with the latest code base and post it here. Thanks @Rhett-Ying !

chang-l commented 1 year ago

Yes, the issue is still there and we need to fix it. Here is the file diff: https://gist.github.com/chang-l/63aa5beb79ec94bbccbd1aea07ec37b3 to run the example in non-uva mode (cpu sampling and feature fetching). I ran it with command: python multi_gpu_node_classification.py --gpu 0,1,2,3 on 4XA100.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

dmlc / dgl