facebookresearch / NARS

Scalable Graph Neural Networks for Heterogeneous Graphs
Other
72 stars 12 forks source link

dgl loading mag dataset device error #9

Open khan-yin opened 1 year ago

khan-yin commented 1 year ago

dear author, when I cloned the code and running with dglcu113,pytorch+cu113,python3.9 for ogbn-mag dataset,I got some problem about loading embbeding.pt generated by transE, do you have any solution to running without change my conda env?

Traceback (most recent call last):
  File "/home/yikh/mycode/NARS/train.py", line 139, in <module>
    main(args)
  File "/home/yikh/mycode/NARS/train.py", line 45, in main
    data = load_data(device, args)
  File "/home/yikh/mycode/NARS/data.py", line 107, in load_data
    return load_mag(device, args)
  File "/home/yikh/mycode/NARS/data.py", line 160, in load_mag
    g.nodes["author"].data["feat"] = author_emb.to(device)
  File "/home/yikh/.conda/envs/ykh/lib/python3.9/site-packages/dgl/view.py", line 90, in __setitem__
    self._graph._set_n_repr(self._ntid, self._nodes, {key: val})
  File "/home/yikh/.conda/envs/ykh/lib/python3.9/site-packages/dgl/heterograph.py", line 4122, in _set_n_repr
    raise DGLError('Cannot assign node feature "{}" on device {} to a graph on'
dgl._ffi.base.DGLError: Cannot assign node feature "feat" on device cuda:0 to a graph on device cpu. Call DGLGraph.to() to copy the graph to the same device.

if I add codes g = g.to(device) before loading it got another error with device, and I tried to fix the bug, but I made much more error😭,Thanks a lot.

Traceback (most recent call last):
  File "/home/yikh/mycode/NARS/train.py", line 139, in <module>
    main(args)
  File "/home/yikh/mycode/NARS/train.py", line 52, in main
    feats = preprocess_features(g, rel_subsets, args, device)
  File "/home/yikh/mycode/NARS/train.py", line 26, in preprocess_features
    feats = gen_rel_subset_feature(g, subset, args, device)
  File "/home/yikh/mycode/NARS/data.py", line 45, in gen_rel_subset_feature
    src = src.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
lingfanyu commented 1 year ago

Hi @khan-yin

This project was developed when DGL only stored graph structure in CPU memory. As a result, it's not compatible with newer features of DGL. Since DGL is a fast-developing library, I can imagine that the latest DGL is very different from the version I used more than 2 years ago, and potentially many code might be broken. Ideally, the preprocessing pipeline needs to be reworked, but unfortunately, I wouldn't have time to do so.

Essentially, the problem here is the code (when I wrote it) assumes graph structure related preprocessing happens on CPU, which violates current DGL's requirement that graph and feature should be on the same device. I think you have two straight-forward options to fix the issue: 1) Perform all preprocessing on CPU. Avoid moving the graph or any node/edge feature to GPU during the data loading. After preprocessing is done, move graph and feature to GPU and start training. The downside is you can't use GPU for preprocessing and this could take more time. 2) Perform all preprocessing on GPU. This is like what you did: move graph to GPU at the beginning. And I think this should be the right way to go. The line src = src.numpy() was there because in DGL 0.4.3 (which was released 2-3 years ago), DGL can't take framework tensor (namely pytorch tensor) directly as input to construct the graph. If DGL now can directly take a torch GPU tensor as input, then the linesrc = src.numpy() is no longer needed.