Closed jiaqima closed 2 years ago
@xingjian-zhang, could you take a look at the DGL dataloader and see how do they store the train/val/test splits for node classification on heterogeneous graph?
See also this test result: https://github.com/Graph-Learning-Benchmarks/GLB-Repo/runs/7463071235?check_suite_focus=true
@xingjian-zhang, could you take a look at the DGL dataloader and see how do they store the train/val/test splits for node classification on heterogeneous graph?
TLDR: dgl.dataloading.DataLoader
takes in an argument indices
for train/val/test indices of nodes. For heterogeneous graph, this argument should be a Dict[ntype, Tensor]
type that maps from node type to tensors. The actual meaning of the indices
is defined by the sample()
method of graph_sampler
of DataLoader
.
References:
@xingjian-zhang, could you take a look at the DGL dataloader and see how do they store the train/val/test splits for node classification on heterogeneous graph?
TLDR:
dgl.dataloading.DataLoader
takes in an argumentindices
for train/val/test indices of nodes. For heterogeneous graph, this argument should be aDict[ntype, Tensor]
type that maps from node type to tensors. The actual meaning of theindices
is defined by thesample()
method ofgraph_sampler
ofDataLoader
.References:
Ah, sorry, I actually meant DGL datasets for heterogeneous graphs. We don't need to worry about dataloader as long as we give the same API at the DGL datasets level.
According to the RGCN example, it seems that they just store the masks under each node type. In our implementation, we can also do the same. May also need a re-index step.
Describe the bug There is an error when loading
ogbn-mag_task.npz
. I checkedogbn-mag_task.npz
and it turns out that the value corresponding to the key "train" is a dictionary arrayAnd this is obtained by the following code in
ogbn-mag.ipynb
:To Reproduce First checkout the branch in PR #154, then run
Expected behavior We need to consider that for a NodeClassification task on a heterogeneous graph, do we need to distinguish the node group when specifying the train/val/test split?
It seems to me that we don't have to do this for the purpose of dataset storage. So first we should correct
ogbn-mag_task.npz
by changing the dictionary to the list of global node ids.We also need to investigate how does DGL
dataloaderdataset store the train/val/test masks for heterogeneous graphs.In summary, there are two todo items:
ogbn-mag_task.npz
get_glb_task
generate suitable train/val/test masks consistent with DGLdataloaderdataset.Screenshots