Graph-Learning-Benchmarks / gli

🗂 Graph Learning Indexer: a contributor-friendly and metadata-rich platform for graph learning benchmarks. Dataloading, Benchmarking, Tagging, and more!
https://graph-learning-benchmarks.github.io/gli/
MIT License
41 stars 19 forks source link

[BUG] ogbn-mag_task.npz loading error #171

Closed jiaqima closed 2 years ago

jiaqima commented 2 years ago

Describe the bug There is an error when loading ogbn-mag_task.npz. I checked ogbn-mag_task.npz and it turns out that the value corresponding to the key "train" is a dictionary array

array({'paper': array([     0,      1,      2, ..., 736386, 736387, 736388])}, dtype=object)

And this is obtained by the following code in ogbn-mag.ipynb:

dataset = NodePropPredDataset(name = "ogbn-mag")
split_idx = dataset.get_idx_split()
train_idx, valid_idx, test_idx = split_idx["train"], split_idx["valid"], split_idx["test"]

To Reproduce First checkout the branch in PR #154, then run

glb.dataloading.get_glb_task("ogbn-mag", "task")

Expected behavior We need to consider that for a NodeClassification task on a heterogeneous graph, do we need to distinguish the node group when specifying the train/val/test split?

It seems to me that we don't have to do this for the purpose of dataset storage. So first we should correct ogbn-mag_task.npz by changing the dictionary to the list of global node ids.

We also need to investigate how does DGL dataloader dataset store the train/val/test masks for heterogeneous graphs.

In summary, there are two todo items:

Screenshots

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jiaqima/thesis/GLB-Repo/glb/dataloading.py", line 97, in get_glb_task
    return read_glb_task(task_path, verbose=verbose)
  File "/home/jiaqima/thesis/GLB-Repo/glb/task.py", line 207, in read_glb_task
    return NodeClassificationTask(task_dict, pwd)
  File "/home/jiaqima/thesis/GLB-Repo/glb/task.py", line 117, in __init__
    super().__init__(task_dict, pwd, device)
  File "/home/jiaqima/thesis/GLB-Repo/glb/task.py", line 50, in __init__
    self._load(task_dict)
  File "/home/jiaqima/thesis/GLB-Repo/glb/task.py", line 122, in _load
    self._load_split(task_dict)
  File "/home/jiaqima/thesis/GLB-Repo/glb/task.py", line 108, in _load_split
    self.split[dataset_] = file_reader.get(path, key, self.device)
  File "/home/jiaqima/thesis/GLB-Repo/glb/utils.py", line 89, in get
    return torch.from_numpy(array).to(device=device)
TypeError: expected np.ndarray (got dict)
jiaqima commented 2 years ago

@xingjian-zhang, could you take a look at the DGL dataloader and see how do they store the train/val/test splits for node classification on heterogeneous graph?

jiaqima commented 2 years ago

See also this test result: https://github.com/Graph-Learning-Benchmarks/GLB-Repo/runs/7463071235?check_suite_focus=true

xingjian-zhang commented 2 years ago

@xingjian-zhang, could you take a look at the DGL dataloader and see how do they store the train/val/test splits for node classification on heterogeneous graph?

TLDR: dgl.dataloading.DataLoader takes in an argument indices for train/val/test indices of nodes. For heterogeneous graph, this argument should be a Dict[ntype, Tensor] type that maps from node type to tensors. The actual meaning of the indices is defined by the sample() method of graph_sampler of DataLoader.

References:

  1. Doc: 5.1 Node Classification/Regression
  2. Doc: 6.1 Training GNN for Node Classification with Neighborhood Sampling
  3. Example: RGCN Implementation
  4. Doc: dgl.dataloading.DataLoader
jiaqima commented 2 years ago

@xingjian-zhang, could you take a look at the DGL dataloader and see how do they store the train/val/test splits for node classification on heterogeneous graph?

TLDR: dgl.dataloading.DataLoader takes in an argument indices for train/val/test indices of nodes. For heterogeneous graph, this argument should be a Dict[ntype, Tensor] type that maps from node type to tensors. The actual meaning of the indices is defined by the sample() method of graph_sampler of DataLoader.

References:

  1. Doc: 5.1 Node Classification/Regression
  2. Doc: 6.1 Training GNN for Node Classification with Neighborhood Sampling
  3. Example: RGCN Implementation
  4. Doc: dgl.dataloading.DataLoader

Ah, sorry, I actually meant DGL datasets for heterogeneous graphs. We don't need to worry about dataloader as long as we give the same API at the DGL datasets level.

According to the RGCN example, it seems that they just store the masks under each node type. In our implementation, we can also do the same. May also need a re-index step.