InfoGraph example fails on GPU

melnimr commented 2 years ago

🐛 Bug

Running the InfoGraph example on GPU fails.

   return th.repeat_interleave(input, repeats, dim) # PyTorch 1.1
RuntimeError: repeats must have the same size as input along dim

All I did is run:

 python infograph/semisupervised.py --gpu 0 --target mu

To Reproduce

Steps to reproduce the behavior:

Go to DGL/examples folder
Run semisupervised eample

Traceback (most recent call last): File "semisupervised.py", line 217, in for sup_data, unsup_data in zip(train_loader, unsup_loader): File "/home/neo/wellth-wrk/env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in next data = self._next_data() File "/home/neo/wellth-wrk/env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 570, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/neo/wellth-wrk/env/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch return self.collate_fn(data) File "semisupervised.py", line 116, in collate graph_id = dgl.broadcast_nodes(batched_graph, graph_id) File "/home/neo/wellth-wrk/env/lib/python3.8/site-packages/dgl/readout.py", line 418, in broadcast_nodes return F.repeat(graph_feat, graph.batch_num_nodes(ntype), dim=0) File "/home/neo/wellth-wrk/env/lib/python3.8/site-packages/dgl/backend/pytorch/tensor.py", line 189, in repeat return th.repeat_interleave(input, repeats, dim) # PyTorch 1.1 RuntimeError: repeats must have the same size as input along dim

Expected behavior

Code runs and finishes training.

Environment

DGL Version (e.g., 1.0): 0.6.1
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):1.11.0
OS (e.g., Linux): Ubuntu
How you installed DGL (conda, pip, source): PIP
Build command you used (if compiling from source):
Python version: 3.8
CUDA/cuDNN version (if applicable): 11.4
GPU models and configuration (e.g. V100): Titan RTX
Any other relevant information:

Additional context

Rhett-Ying commented 2 years ago

I tried with dgl-cu102=0.6.1, PIP, torch=1.10.1+cu102 on Ubuntu18.04 and it works well.

have you tried to train with CPU? does it work well on cpu?

melnimr commented 2 years ago

It is actually the same issue on CPU as well. When trying other examples like gcn, ...etc. Both modes, CPU and GPU work.

Rhett-Ying commented 2 years ago

I just find the same issue is hit in dgl-cu102==0.8.1. but it works well in dgl-cu102==0.6.1. could you double confirm the dgl version you're using via print(dgl.__version__)?

melnimr commented 2 years ago

This is how my pip freeze looks like:

dgl==0.6.1 dgl-cu102==0.8.1 dgl-cu113==0.8.1 dglgo==0.0.1

I install the CUDA version this way: pip3 install dgl-cu113 dglgo -f https://data.dgl.ai/wheels/repo.html

How do you specify 0.6.1 for the CUDA version?

melnimr commented 2 years ago

Found out the problem!

I was installing using the instructions on the website:

pip install dgl-cu113 dglgo -f https://data.dgl.ai/wheels/repo.html

which results in 0.8.1 of the CUDA version being installed (it is grabbing the wheels files from that URL above). If I install using just pip:

pip install dgl-cu111

Which would install DGL CUDA 0.6.1 instead (inline with the DGL 0.6.1).

Rhett-Ying commented 2 years ago

pip3 install dgl-cu102==0.6.1 -f https://data.dgl.ai/wheels/repo.html

So this issue does not exist in 0.6.1, but it exists in 0.8.1. Let me keep an eye on it.

melnimr commented 2 years ago

Yes, that is correct.

TristonC commented 2 years ago

@Rhett-Ying I also reproduced the same error in 0.8.0post2 version and dgl 0.7.2.

jermainewang commented 2 years ago

@chang-l Do you plan to work on this?

chang-l commented 2 years ago

@chang-l Do you plan to work on this?

Sure. I will take a look.

chang-l commented 2 years ago

The root cause of the crash is due to this PR: https://github.com/dmlc/dgl/pull/3351/files

dmlc / dgl