chenzhao / light-dist-gnn

44 stars 5 forks source link

Incorrect positional argument #2

Open LongerZrLong opened 2 years ago

LongerZrLong commented 2 years ago

https://github.com/chenzhao/light-dist-gnn/blob/65495aa8d2e851c875986b344b61b77b12953e29/coo_graph/parted_coo_graph.py#L92

I try to run prepare_data.py and notice that the usage of positional argument here is incorrect. The correct way to invoke the function should be

Parted_COO_Graph(self.name, i, num_parts, preprocess_for=self.preprocess_for)

Otherwise, the self.preprocess_for will be passed to the device argument of Parted_CPP_Graph.

BearBiscuit05 commented 1 year ago

Thank you for your issue, I also solved this problem, but I encountered another problem, when I run the program on 2 GPUs, it shows the following error, I don’t know how to solve it, if you run this program successfully, can you give some advice?

Traceback (most recent call last):
  File "/home/light-dist-gnn/main.py", line 37, in <module>
    torch.multiprocessing.spawn(process_wrapper, process_args, args.nprocs)
  File "/root/miniconda3/envs/gnn/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/root/miniconda3/envs/gnn/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/root/miniconda3/envs/gnn/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/root/miniconda3/envs/gnn/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/light-dist-gnn/main.py", line 24, in process_wrapper
    func(env, args)
  File "/home/light-dist-gnn/dist_train.py", line 71, in main
    train(g, env, total_epoch=args.epoch)
  File "/home/light-dist-gnn/dist_train.py", line 39, in train
    outputs = model(g.features)
  File "/root/miniconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/light-dist-gnn/models/cached_gcn.py", line 105, in forward
    hidden_features = F.relu(DistGCNLayer.apply(features, self.weight1, self.g.adj_parts, 'L1'))
  File "/home/light-dist-gnn/models/cached_gcn.py", line 75, in forward
    z_local = cached_broadcast(adj_parts, features, 'Forward'+tag)
  File "/home/light-dist-gnn/models/cached_gcn.py", line 56, in cached_broadcast
    dist.broadcast(feature_bcast, src=src)
  File "/root/miniconda3/envs/gnn/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1159, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: Tensors must be CUDA and dense
LongerZrLong commented 1 year ago

It has been a while since I last ran the code and I am not sure whether I ran it with GPUs or on with merely CPU. I would recommend to first run with only CPU to see if the code works since the issue from your Error Stack Trace is likely related to CUDA in torch.