Open Ying-1106 opened 3 months ago
this is the code about generating Dataset:
base_dir = os.path.join(now_dir,'HGBl_base_dir')
graph_file_path = '/data/zzh/TEST_DIR/HGBl_dir/HGBl-amazon_DGLGraph.bin' HGBl_Graph = dgl.load_graphs(filename=graph_file_path)[0][0]
feature = HGBl_Graph.ndata['h'] product_feat_np = feature.numpy() product_feat_file = os.path.join(base_dir,'product_feat_file.npy') np.save(file=product_feat_file,arr=product_feat_np)
src,dst = HGBl_Graph.edges(etype=('product','product-product-0','product') ) src = src.numpy() dst = dst.numpy() P0P_npy = np.stack((src, dst)) P0P_npy_file = os.path.join(base_dir,'P0P.npy') np.save(file=P0P_npy_file,arr=P0P_npy)
src,dst = HGBl_Graph.edges(etype=('product','product-product-1','product') ) src = src.numpy() dst = dst.numpy() P1P_npy = np.stack((src, dst)) P1P_npy_file = os.path.join(base_dir,'P1P.npy') np.save(file=P1P_npy_file,arr=P1P_npy)
train_set_POP_path = "/data/zzh/TEST_DIR/HGBl_base_dir/train_set_P0P.npy" train_set_P1P_path = "/data/zzh/TEST_DIR/HGBl_base_dir/train_set_P1P.npy"
val_set_POP_path = "/data/zzh/TEST_DIR/HGBl_base_dir/val_set_P0P.npy" val_set_P1P_path = "/data/zzh/TEST_DIR/HGBl_base_dir/val_set_P1P.npy"
test_set_POP_path = "/data/zzh/TEST_DIR/HGBl_base_dir/test_set_P0P.npy" test_set_P1P_path = "/data/zzh/TEST_DIR/HGBl_base_dir/test_set_P1P.npy"
yaml_content = f""" dataset_name: HGBl_amazon_GB graph: nodes:
type: product num: 10099
edges:
type: "product:product-product-0:product" format: numpy path: {os.path.basename(P0P_npy_file)}
type: "product:product-product-1:product" format: numpy path: {os.path.basename(P1P_npy_file)}
feature_data:
- domain: node
type: product
name: feat
format: numpy
in_memory: false
path: {os.path.basename(product_feat_file)}
tasks:
name: link_prediction num_classes: 100 train_set:
type: "product:product-product-0:product" data:
type: "product:product-product-1:product" data:
validation_set:
type: "product:product-product-0:product" data:
type: "product:product-product-1:product" data:
test_set:
type: "product:product-product-0:product" data:
type: "product:product-product-1:product" data:
"""
metadata_path = os.path.join(base_dir, "metadata.yaml") with open(metadata_path, "w") as f: f.write(yaml_content)
dataset = gb.OnDiskDataset(base_dir).load() graph = dataset.graph.to(device) feature = dataset.feature.to(device) tasks = dataset.tasks link_pred_task = tasks[0]
datapipe = gb.ItemSampler(link_pred_task.train_set, batch_size=16, shuffle=True) datapipe = datapipe.copy_to(device) datapipe = datapipe.sample_uniform_negative(graph, 1) datapipe = datapipe.sample_neighbor(graph, [-1, -1,-1]) datapipe = datapipe.fetch_feature( feature, node_feature_keys={"product": ["feat"]} )
dataloader = gb.DataLoader(datapipe,num_workers=0)
Hello @Ying-1106, it'd be helpful if you can provide the error message. And you can try print(train_set)
to examine the training set and check if data is correct.
And please share which DGL version you're using.
And please share which DGL version you're using.
My DGL version is 2.2.1 + cu118
Hello @Ying-1106, it'd be helpful if you can provide the error message. And you can try
print(train_set)
to examine the training set and check if data is correct.
print(link_pred_task.train_set) ItemSetDict( itemsets={'product:product-product-0:product': ItemSet( items=(tensor([[ 552, 7161], [8166, 9154], [2310, 2945], ..., [1367, 4038], [ 728, 7947], [5994, 5039]], dtype=torch.int32),), names=('seeds',), ), 'product:product-product-1:product': ItemSet( items=(tensor([[ 454, 8906], [7462, 9232], [8126, 359], ..., [4892, 731], [6761, 3064], [8407, 9684]], dtype=torch.int32),), names=('seeds',), )}, names=('seeds',), )
Whenever I step through this line 【for step, data in enumerate(dataloader):】, the code terminates abruptly, and the terminal outputs either "free(): invalid size," "munmap_chunk(): invalid pointer," or "double free or corruption (out)." Any of these three errors might be output.
Hello @Ying-1106, it'd be helpful if you can provide the error message. And you can try
print(train_set)
to examine the training set and check if data is correct.
RuntimeError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
This exception is thrown by iter of Bufferer(datapipe=FeatureFetcher)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
response = gen.send(None)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter
for data in self.datapipe:
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
response = gen.send(None)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter
for data in self.datapipe:
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
response = gen.send(None)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 125, in iter
yield self._apply_fn(data)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 90, in _apply_fn
return self.fn(data)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/minibatch_transformer.py", line 38, in _transformer
minibatch = self.transformer(minibatch)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/subgraph_sampler.py", line 65, in _preprocess
) = SubgraphSampler._seeds_preprocess(minibatch)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/subgraph_sampler.py", line 166, in _seeds_preprocess
unique_seeds, compacted = unique_and_compact(nodes)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/internal/sample_utils.py", line 56, in unique_and_compact
unique[ntype], compacted[ntype] = unique_and_compact_per_type(
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/internal/sample_utils.py", line 47, in unique_and_compact_pertype
unique, compacted, = torch.ops.graphbolt.unique_and_compact(
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/ops.py", line 854, in call
return self._op(*args, **(kwargs or {}))
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
This exception is thrown by iter of MiniBatchTransformer(datapipe=UniformNegativeSampler, transformer=_preprocess)
During handling of the above exception, another exception occurred:
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
response = gen.send(None)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter
for data in self.datapipe:
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
response = gen.send(None)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter
for data in self.datapipe:
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 203, in wrap_generator
full_msg = f"{msg} {datapipe.class.name}({_generate_input_args_string(datapipe)})"
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 43, in _generate_input_args_string
result.append((name, _simplify_obj_name(value)))
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 27, in _simplify_obj_name
return repr(obj)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/impl/fused_csc_sampling_graph.py", line 39, in repr
csc_indptr_str = str(self.csc_indptr)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor.py", line 464, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 697, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 617, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 349, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 375, in get_summarized_data
return torch.cat(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
This exception is thrown by iter of CompactPerLayer(datapipe=SamplePerLayer, deduplicate=True)
During handling of the above exception, another exception occurred:
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
response = gen.send(None)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter
for data in self.datapipe:
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
response = gen.send(None)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter
for data in self.datapipe:
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 203, in wrap_generator
full_msg = f"{msg} {datapipe.class.name}({_generate_input_args_string(datapipe)})"
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 43, in _generate_input_args_string
result.append((name, _simplify_obj_name(value)))
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 27, in _simplify_obj_name
return repr(obj)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/impl/fused_csc_sampling_graph.py", line 39, in repr
csc_indptr_str = str(self.csc_indptr)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor.py", line 464, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 697, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 617, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 349, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 375, in get_summarized_data
return torch.cat(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
This exception is thrown by iter of CompactPerLayer(datapipe=SamplePerLayer, deduplicate=True)
During handling of the above exception, another exception occurred:
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
response = gen.send(None)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter
for data in self.datapipe:
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
response = gen.send(None)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/dataloader.py", line 68, in iter
yield from self.dataloader
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next
data = self._next_data()
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 41, in fetch
data = next(self.dataset_iter)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 152, in next
return self._get_next()
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 140, in _get_next
result = next(self.iterator)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 224, in wrap_next
result = next_func(*args, **kwargs)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/datapipe.py", line 383, in next
return next(self._datapipe_iter)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
response = gen.send(None)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter
for data in self.datapipe:
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
response = gen.send(None)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter
for data in self.datapipe:
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
response = gen.send(None)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter
for data in self.datapipe:
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
response = gen.send(None)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter
for data in self.datapipe:
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 203, in wrap_generator
full_msg = f"{msg} {datapipe.class.name}({_generate_input_args_string(datapipe)})"
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 43, in _generate_input_args_string
result.append((name, _simplify_obj_name(value)))
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 27, in _simplify_obj_name
return repr(obj)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/impl/fused_csc_sampling_graph.py", line 39, in repr
csc_indptr_str = str(self.csc_indptr)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor.py", line 464, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 697, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 617, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 349, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 375, in get_summarized_data
return torch.cat(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
This exception is thrown by iter of CompactPerLayer(datapipe=SamplePerLayer, deduplicate=True)
During handling of the above exception, another exception occurred:
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 375, in get_summarized_data
return torch.cat(
File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 385, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
This exception is thrown by iter of Bufferer(datapipe=FeatureFetcher)
how do you generate the train_set
? are the Node IDs in each seed
is edge type wised?
how do you generate the
train_set
? are the Node IDs in eachseed
is edge type wised?
tasks:
train_set_POP = np.load(train_set_POP_path) train_set_P1P = np.load(train_set_P1P_path) prin(train_set_P0P): train_set_POP array([[ 552, 7161], [8166, 9154], [2310, 2945], ..., [1367, 4038], [ 728, 7947], [5994, 5039]])
print(train_set_P1P): train_set_P1P array([[ 454, 8906], [7462, 9232], [8126, 359], ..., [4892, 731], [6761, 3064], [8407, 9684]])
In order to dive deep into the root cause, I recommend to narrow down the case with following suggestions.
In order to dive deep into the root cause, I recommend to narrow down the case with following suggestions.
- does it crash on first iteration?
- could you try with CPU sampling?
- try with small fanout, single layer.
Thank you for your patient response. I have now resolved the issue, and the code for link prediction and node classification on heterogeneous graphs is running correctly. The previous bug might have been due to inconsistent devices.
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
🐛 Bug
When I was using GraphBolt for a heterogeneous graph link prediction task, errors frequently occurred during batch generation. I created a dataset called HGBl-amazon, which includes one type of node: product, and two types of edges: Product-0-Product and Product-1-Product. I constructed a link prediction task and stored edge information in the train_set, val_set and test_set like GraphBolt examples. However, I always encountered errors while iterating through the dataloader.
code: self.model.train() loss_all = 0.0 for i, data in enumerate(self.train_dataloader): # this line always raise error.