Closed melnimr closed 2 years ago
I tried with dgl-cu102=0.6.1, PIP, torch=1.10.1+cu102
on Ubuntu18.04 and it works well.
have you tried to train with CPU? does it work well on cpu?
It is actually the same issue on CPU as well. When trying other examples like gcn, ...etc. Both modes, CPU and GPU work.
I just find the same issue is hit in dgl-cu102==0.8.1
. but it works well in dgl-cu102==0.6.1
. could you double confirm the dgl version you're using via print(dgl.__version__)
?
This is how my pip freeze looks like:
dgl==0.6.1 dgl-cu102==0.8.1 dgl-cu113==0.8.1 dglgo==0.0.1
I install the CUDA version this way: pip3 install dgl-cu113 dglgo -f https://data.dgl.ai/wheels/repo.html
How do you specify 0.6.1 for the CUDA version?
Found out the problem!
I was installing using the instructions on the website:
pip install dgl-cu113 dglgo -f https://data.dgl.ai/wheels/repo.html
which results in 0.8.1 of the CUDA version being installed (it is grabbing the wheels files from that URL above). If I install using just pip:
pip install dgl-cu111
Which would install DGL CUDA 0.6.1 instead (inline with the DGL 0.6.1).
pip3 install dgl-cu102==0.6.1 -f https://data.dgl.ai/wheels/repo.html
So this issue does not exist in 0.6.1
, but it exists in 0.8.1
. Let me keep an eye on it.
Yes, that is correct.
@Rhett-Ying I also reproduced the same error in 0.8.0post2 version and dgl 0.7.2.
@chang-l Do you plan to work on this?
@chang-l Do you plan to work on this?
Sure. I will take a look.
The root cause of the crash is due to this PR: https://github.com/dmlc/dgl/pull/3351/files
🐛 Bug
Running the InfoGraph example on GPU fails.
All I did is run:
To Reproduce
Steps to reproduce the behavior:
Traceback (most recent call last): File "semisupervised.py", line 217, in
for sup_data, unsup_data in zip(train_loader, unsup_loader):
File "/home/neo/wellth-wrk/env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in next
data = self._next_data()
File "/home/neo/wellth-wrk/env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 570, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/neo/wellth-wrk/env/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
return self.collate_fn(data)
File "semisupervised.py", line 116, in collate
graph_id = dgl.broadcast_nodes(batched_graph, graph_id)
File "/home/neo/wellth-wrk/env/lib/python3.8/site-packages/dgl/readout.py", line 418, in broadcast_nodes
return F.repeat(graph_feat, graph.batch_num_nodes(ntype), dim=0)
File "/home/neo/wellth-wrk/env/lib/python3.8/site-packages/dgl/backend/pytorch/tensor.py", line 189, in repeat
return th.repeat_interleave(input, repeats, dim) # PyTorch 1.1
RuntimeError: repeats must have the same size as input along dim
Expected behavior
Code runs and finishes training.
Environment
conda
,pip
, source): PIPAdditional context