Open chang-l opened 1 year ago
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
@chang-l I cannot reproduce the error you hit. could you share what you exactly changed in multi_gpu_node_classification.py
?
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/ubuntu/workspace/dgl_0/examples/pytorch/multigpu/multi_gpu_node_classification.py", line 201, in run
train(
File "/home/ubuntu/workspace/dgl_0/examples/pytorch/multigpu/multi_gpu_node_classification.py", line 127, in train
train_dataloader = DataLoader(
File "/home/ubuntu/.local/lib/python3.8/site-packages/dgl/dataloading/dataloader.py", line 940, in __init__
raise ValueError(
ValueError: Expect graph and indices to be on the same device when use_uva=False.
Thanks for picking it up @Rhett-Ying. You need to make sure train_idx
and val_idx
in cpu, i.e., commenting out the following two lines https://github.com/dmlc/dgl/blob/41baa0e4483b493e74c4dee6dc67abf6d120a1cc/examples/pytorch/multigpu/multi_gpu_node_classification.py#L190-L191
while set num_workers > 0
. This is cpu-sampling (with num_workers) + multi-gpu training. Please let me know if you can reproduce at current version.
@chang-l below is what I changed and it crashed with different error from yours.
python3 examples/pytorch/multigpu/multi_gpu_node_classification.py --gpu 0
ValueError: num_workers must be 0 if UVA sampling is enabled.
--- a/examples/pytorch/multigpu/multi_gpu_node_classification.py
+++ b/examples/pytorch/multigpu/multi_gpu_node_classification.py
@@ -132,7 +132,7 @@ def train(
batch_size=1024,
shuffle=True,
drop_last=False,
- num_workers=0,
+ num_workers=4,
use_ddp=True,
use_uva=use_uva,
)
@@ -187,8 +187,8 @@ def run(proc_id, nprocs, devices, g, data, mode):
rank=proc_id,
)
num_classes, train_idx, val_idx, test_idx = data
- train_idx = train_idx.to(device)
- val_idx = val_idx.to(device)
+ #train_idx = train_idx.to(device)
+ #val_idx = val_idx.to(device)
g = g.to(device if mode == "puregpu" else "cpu")
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
Hi @chang-l Do you have any comments on https://github.com/dmlc/dgl/issues/5526#issuecomment-1552331872? and do we need to prioritize this issue?
Sorry, this issue falls out of my radar... I will give it another try this week with the latest code base and post it here. Thanks @Rhett-Ying !
Yes, the issue is still there and we need to fix it. Here is the file diff: https://gist.github.com/chang-l/63aa5beb79ec94bbccbd1aea07ec37b3 to run the example in non-uva mode (cpu sampling and feature fetching). I ran it with command:
python multi_gpu_node_classification.py --gpu 0,1,2,3
on 4XA100.
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
🐛 Bug
In multi-gpu examples, with the current code-base, if we have
num_worker>1
in non-uva mode, DGL would crash with assertion error.To Reproduce
multigpu/multi_gpu_node_classification.py
for non-uva mode withnum_worker=4
python multi_gpu_node_classification.py --gpu 0,1,2,3
Error msg and stack trace:
Expected behavior
it should run through.
Environment
conda
,pip
, source): sourceAdditional context
I think the assertion error might be due to the excessive overlap of compute stream (default stream) and prefetching stream by checking with profiler. Also, the crash goes away when
use_alternate_stream=False
.