Open LongerZrLong opened 2 years ago
Thank you for your issue, I also solved this problem, but I encountered another problem, when I run the program on 2 GPUs, it shows the following error, I don’t know how to solve it, if you run this program successfully, can you give some advice?
Traceback (most recent call last):
File "/home/light-dist-gnn/main.py", line 37, in <module>
torch.multiprocessing.spawn(process_wrapper, process_args, args.nprocs)
File "/root/miniconda3/envs/gnn/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/miniconda3/envs/gnn/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/root/miniconda3/envs/gnn/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/miniconda3/envs/gnn/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/light-dist-gnn/main.py", line 24, in process_wrapper
func(env, args)
File "/home/light-dist-gnn/dist_train.py", line 71, in main
train(g, env, total_epoch=args.epoch)
File "/home/light-dist-gnn/dist_train.py", line 39, in train
outputs = model(g.features)
File "/root/miniconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/light-dist-gnn/models/cached_gcn.py", line 105, in forward
hidden_features = F.relu(DistGCNLayer.apply(features, self.weight1, self.g.adj_parts, 'L1'))
File "/home/light-dist-gnn/models/cached_gcn.py", line 75, in forward
z_local = cached_broadcast(adj_parts, features, 'Forward'+tag)
File "/home/light-dist-gnn/models/cached_gcn.py", line 56, in cached_broadcast
dist.broadcast(feature_bcast, src=src)
File "/root/miniconda3/envs/gnn/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1159, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: Tensors must be CUDA and dense
It has been a while since I last ran the code and I am not sure whether I ran it with GPUs or on with merely CPU. I would recommend to first run with only CPU to see if the code works since the issue from your Error Stack Trace is likely related to CUDA in torch
.
https://github.com/chenzhao/light-dist-gnn/blob/65495aa8d2e851c875986b344b61b77b12953e29/coo_graph/parted_coo_graph.py#L92
I try to run
prepare_data.py
and notice that the usage of positional argument here is incorrect. The correct way to invoke the function should beOtherwise, the
self.preprocess_for
will be passed to thedevice
argument ofParted_CPP_Graph
.