【GraphBolt】【Bug】exception is thrown by __iter__ of FeatureFetcher

wangguan1995 commented 7 months ago

🐛 Bug

print graph Graph(num_nodes=104818, num_edges=1150630,
      ndata_schemes={}
      edata_schemes={})
Exception in thread Thread-1 (thread_worker):
Traceback (most recent call last):
  File "/root/anaconda3/envs/GINO/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/root/anaconda3/envs/GINO/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/root/anaconda3/envs/GINO/lib/python3.10/site-packages/torchdata/datapipes/iter/util/prefetcher.py", line 69, in thread_worker
    item = next(itr)
  File "/root/anaconda3/envs/GINO/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 173, in wrap_generator
    response = gen.send(None)
  File "/root/anaconda3/envs/GINO/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 123, in __iter__
    yield self._apply_fn(data)
  File "/root/anaconda3/envs/GINO/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 88, in _apply_fn
    return self.fn(data)
  File "/root/anaconda3/envs/GINO/lib/python3.10/site-packages/dgl/graphbolt/minibatch_transformer.py", line 38, in _transformer
    minibatch = self.transformer(minibatch)
  File "/root/anaconda3/envs/GINO/lib/python3.10/site-packages/dgl/graphbolt/feature_fetcher.py", line 97, in _read
    node_features[feature_name] = self.feature_store.read(
  File "/root/anaconda3/envs/GINO/lib/python3.10/site-packages/dgl/graphbolt/impl/basic_feature_store.py", line 58, in read
    return self._features[(domain, type_name, feature_name)].read(ids)

KeyError: "
This exception is thrown by __iter__ of 

FeatureFetcher(
    datapipe=MultiprocessingWrapper, edge_feature_keys=None, feature_store=
TorchBasedFeatureStore{

(<OnDiskFeatureDataDomain.NODE: 'node'>, 'float32', 'x_in') : 
TorchBasedFeature(feature=tensor([
    [ 0.9825,  0.4900, -0.0148],
    [ 0.9803,  0.4900, -0.0105],
    [ 0.9846,  0.2326, -0.3140],
    ...,
    [ 0.1170,  0.2753,  0.7003],
    [ 0.1107,  0.2824,  0.7003],
    [ 0.1128,  0.2753,  0.7003]]),
    metadata={},
), 

(<OnDiskFeatureDataDomain.NODE: 'node'>, 'float32', 'area'): 
TorchBasedFeature(feature=tensor([
    [0.3695],
    [0.3695],
    [0.2457],
    ...,
    [0.3603],
    [0.3604],
    [0.3604]]),
    metadata={},
)}, 
node_feature_keys=['x_in', 'area'])"

To Reproduce

Steps to reproduce the behavior:

1.Download data https://drive.google.com/drive/folders/1esJ-4ThKsaDQQLQMtZVowwkSlY8thJxr?usp=drive_link 2.Run This

import torch
import dgl
import dgl.graphbolt as gb
graph = dgl.load_graphs("./graph.bin")[0][0]
print("print graph", graph)

feat_data = [
    gb.OnDiskFeatureData(domain="node", type="float32", name="x_in",
        format="numpy", path="./x_in.npy", in_memory=False),
    gb.OnDiskFeatureData(domain="node", type="float32", name="area",
        format="numpy", path="./area_in.npy", in_memory=False),
]
graph = gb.from_dglgraph(graph, True)
feature = gb.TorchBasedFeatureStore(feat_data)
item_set = gb.ItemSet(104818, names="seed_nodes")  
datapipe = gb.ItemSampler(item_set, batch_size=1024, shuffle=False)
datapipe = datapipe.sample_neighbor(graph, [10, 10]) # 2 layers.
datapipe = datapipe.fetch_feature(feature, node_feature_keys=["x_in", "area"])
datapipe = datapipe.copy_to(torch.device('cuda'))
dataloader = gb.DataLoader(datapipe)
mini_batch = next(iter(dataloader))
print("\nmini_batch x_in : ", mini_batch.node_features["x_in"])

Expected behavior

1024 src nodes feature "x_in" should be printed Many dst nodes features "x_in" should be printed

Environment

DGL Version (e.g., 1.0): dgl 2.0.0+cu116
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): torch 1.13.1+cu116
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: python 3.10
CUDA/cuDNN version (if applicable): cuda 11.6 cuDNN 7.4
GPU models and configuration (e.g. V100): A100 40GB
Any other relevant information:

Additional context

wangguan1995 commented 7 months ago

assign type to be None will jump over this bug

type="float32" # not work
type= None # it works

wangguan1995 commented 7 months ago

Another severe issue(at least bothers me a lot): The graph has to be saved and loaded, even if it already exists in my PC memory:

dgl.save_graphs("./graph.bin", graph)
np.save("./x_in.npy", x_in.cpu().numpy())
np.save("./area.npy", area.cpu().numpy().reshape(-1, 1))
graph = dgl.load_graphs("./graph.bin")[0][0]

Each epoch of my training(500 graphs) will cost a huge IO time and enhance so little over my model.

rudongyu commented 7 months ago

As the provided google drive link doesn't contain a file named "x_in.npy", I assume it is replaced with the file "node_feat.npy". I found that the matrix is in shape (2, 3), which is inconsistent with the graph size.

x_in = np.load("node_feat.npy")
print(x_in.shape). # (2, 3)

Could you check your data uploaded and also the shape of your local x_in matrix?

I didn't quite get the second question. What's the purpose of such periodic saving?

TristonNV commented 6 months ago

@mfbalin Could you help to comment here?

mfbalin commented 6 months ago

@mfbalin Could you help to comment here?

I haven't used a custom dataset before, including gb.OnDiskFeatureData. So I don't know what could be going wrong.

Rhett-Ying commented 6 months ago

@wangguan1995 In the code snippet you shared, type is wrongly used. It should be node/edge type name instead of data type you used: type="float32". Here's the correct way to instantiate OnDiskFeatureData

        a = torch.tensor([[1, 2, 4], [2, 5, 3]])
        b = torch.tensor([[[1, 2], [3, 4]], [[2, 5], [3, 4]]])
        write_tensor_to_disk(test_dir, "a", a, fmt="torch")
        write_tensor_to_disk(test_dir, "b", b, fmt="numpy")
        feature_data = [
            gb.OnDiskFeatureData(
                domain="node",
                type="paper",
                name="a",
                format="torch",
                path=os.path.join(test_dir, "a.pt"),
            ),
            gb.OnDiskFeatureData(
                domain="edge",
                type="paper:cites:paper",
                name="b",
                format="numpy",
                path=os.path.join(test_dir, "b.npy"),
            ),
        ]
        feature_store = gb.TorchBasedFeatureStore(feature_data)

Rhett-Ying commented 6 months ago

Corresponding documentation is available here: https://docs.dgl.ai/generated/dgl.graphbolt.TorchBasedFeatureStore.html#dgl.graphbolt.TorchBasedFeatureStore

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

dmlc / dgl