Ying-1106 commented 3 months ago

🐛 Bug

When I was using GraphBolt for a heterogeneous graph link prediction task, errors frequently occurred during batch generation. I created a dataset called HGBl-amazon, which includes one type of node: product, and two types of edges: Product-0-Product and Product-1-Product. I constructed a link prediction task and stored edge information in the train_set, val_set and test_set like GraphBolt examples. However, I always encountered errors while iterating through the dataloader.

code: self.model.train() loss_all = 0.0 for i, data in enumerate(self.train_dataloader): # this line always raise error.

Ying-1106 commented 3 months ago

this is the code about generating Dataset:

base_dir = os.path.join(now_dir,'HGBl_base_dir')

construct the Ondiskdataset from existed dglgraph

graph_file_path = '/data/zzh/TEST_DIR/HGBl_dir/HGBl-amazon_DGLGraph.bin' HGBl_Graph = dgl.load_graphs(filename=graph_file_path)[0][0]

feature = HGBl_Graph.ndata['h'] product_feat_np = feature.numpy() product_feat_file = os.path.join(base_dir,'product_feat_file.npy') np.save(file=product_feat_file,arr=product_feat_np)

src,dst = HGBl_Graph.edges(etype=('product','product-product-0','product') ) src = src.numpy() dst = dst.numpy() P0P_npy = np.stack((src, dst)) P0P_npy_file = os.path.join(base_dir,'P0P.npy') np.save(file=P0P_npy_file,arr=P0P_npy)

src,dst = HGBl_Graph.edges(etype=('product','product-product-1','product') ) src = src.numpy() dst = dst.numpy() P1P_npy = np.stack((src, dst)) P1P_npy_file = os.path.join(base_dir,'P1P.npy') np.save(file=P1P_npy_file,arr=P1P_npy)

The edge information numpy files in train_set, val_set, and test_set have been stored locally, and each set includes the source and target node IDs of two types of edges, P-0-P and P-1-P

Train set

train_set_POP_path = "/data/zzh/TEST_DIR/HGBl_base_dir/train_set_P0P.npy" train_set_P1P_path = "/data/zzh/TEST_DIR/HGBl_base_dir/train_set_P1P.npy"

val set

val_set_POP_path = "/data/zzh/TEST_DIR/HGBl_base_dir/val_set_P0P.npy" val_set_P1P_path = "/data/zzh/TEST_DIR/HGBl_base_dir/val_set_P1P.npy"

test set

test_set_POP_path = "/data/zzh/TEST_DIR/HGBl_base_dir/test_set_P0P.npy" test_set_P1P_path = "/data/zzh/TEST_DIR/HGBl_base_dir/test_set_P1P.npy"

yaml_content = f""" dataset_name: HGBl_amazon_GB graph: nodes:

type: product num: 10099
```
edges:
```
type: "product:product-product-0:product" format: numpy path: {os.path.basename(P0P_npy_file)}
type: "product:product-product-1:product" format: numpy path: {os.path.basename(P1P_npy_file)}

feature_data:
```
- domain: node
  type: product
  name: feat
  format: numpy
  in_memory: false
  path: {os.path.basename(product_feat_file)}
```
tasks:
- name: link_prediction num_classes: 100 train_set:
  - type: "product:product-product-0:product" data:
    - name: seeds format: numpy path: {os.path.basename(train_set_POP_path)}
  - type: "product:product-product-1:product" data:
    - name: seeds format: numpy path: {os.path.basename(train_set_P1P_path)}
    validation_set:
  - type: "product:product-product-0:product" data:
    - name: seeds format: numpy path: {os.path.basename(val_set_POP_path)}
  - type: "product:product-product-1:product" data:
    - name: seeds format: numpy path: {os.path.basename(val_set_P1P_path)}
    test_set:
  - type: "product:product-product-0:product" data:
    - name: seeds format: numpy path: {os.path.basename(test_set_POP_path)}
  - type: "product:product-product-1:product" data:
    - name: seeds format: numpy path: {os.path.basename(test_set_P1P_path)}
"""

metadata_path = os.path.join(base_dir, "metadata.yaml") with open(metadata_path, "w") as f: f.write(yaml_content)

dataset = gb.OnDiskDataset(base_dir).load() graph = dataset.graph.to(device) feature = dataset.feature.to(device) tasks = dataset.tasks link_pred_task = tasks[0]

datapipe = gb.ItemSampler(link_pred_task.train_set, batch_size=16, shuffle=True) datapipe = datapipe.copy_to(device) datapipe = datapipe.sample_uniform_negative(graph, 1) datapipe = datapipe.sample_neighbor(graph, [-1, -1,-1]) datapipe = datapipe.fetch_feature( feature, node_feature_keys={"product": ["feat"]} )

dataloader = gb.DataLoader(datapipe,num_workers=0)

Skeleton003 commented 3 months ago

Hello @Ying-1106, it'd be helpful if you can provide the error message. And you can try print(train_set) to examine the training set and check if data is correct.

Rhett-Ying commented 3 months ago

And please share which DGL version you're using.

Ying-1106 commented 3 months ago

And please share which DGL version you're using.

My DGL version is 2.2.1 + cu118

Ying-1106 commented 3 months ago

Hello @Ying-1106, it'd be helpful if you can provide the error message. And you can try print(train_set) to examine the training set and check if data is correct.

when i print train_set

print(link_pred_task.train_set) ItemSetDict( itemsets={'product:product-product-0:product': ItemSet( items=(tensor([[ 552, 7161], [8166, 9154], [2310, 2945], ..., [1367, 4038], [ 728, 7947], [5994, 5039]], dtype=torch.int32),), names=('seeds',), ), 'product:product-product-1:product': ItemSet( items=(tensor([[ 454, 8906], [7462, 9232], [8126, 359], ..., [4892, 731], [6761, 3064], [8407, 9684]], dtype=torch.int32),), names=('seeds',), )}, names=('seeds',), )

the error

Whenever I step through this line 【for step, data in enumerate(dataloader):】, the code terminates abruptly, and the terminal outputs either "free(): invalid size," "munmap_chunk(): invalid pointer," or "double free or corruption (out)." Any of these three errors might be output.

Ying-1106 commented 3 months ago

Hello @Ying-1106, it'd be helpful if you can provide the error message. And you can try print(train_set) to examine the training set and check if data is correct.

it's the error message:

RuntimeError (note: full exception trace is shown but execution is paused at: _run_module_as_main) CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This exception is thrown by iter of Bufferer(datapipe=FeatureFetcher) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 125, in iter yield self._apply_fn(data) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 90, in _apply_fn return self.fn(data) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/minibatch_transformer.py", line 38, in _transformer minibatch = self.transformer(minibatch) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/subgraph_sampler.py", line 65, in _preprocess ) = SubgraphSampler._seeds_preprocess(minibatch) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/subgraph_sampler.py", line 166, in _seeds_preprocess unique_seeds, compacted = unique_and_compact(nodes) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/internal/sample_utils.py", line 56, in unique_and_compact unique[ntype], compacted[ntype] = unique_and_compact_per_type( File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/internal/sample_utils.py", line 47, in unique_and_compact_pertype unique, compacted, = torch.ops.graphbolt.unique_and_compact( File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/ops.py", line 854, in call return self._op(*args, **(kwargs or {})) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This exception is thrown by iter of MiniBatchTransformer(datapipe=UniformNegativeSampler, transformer=_preprocess)

During handling of the above exception, another exception occurred:

File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 203, in wrap_generator full_msg = f"{msg} {datapipe.class.name}({_generate_input_args_string(datapipe)})" File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 43, in _generate_input_args_string result.append((name, _simplify_obj_name(value))) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 27, in _simplify_obj_name return repr(obj) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/impl/fused_csc_sampling_graph.py", line 39, in repr csc_indptr_str = str(self.csc_indptr) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor.py", line 464, in repr return torch._tensor_str._str(self, tensor_contents=tensor_contents) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 697, in _str return _str_intern(self, tensor_contents=tensor_contents) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 617, in _str_intern tensor_str = _tensor_str(self, indent) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 349, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 375, in get_summarized_data return torch.cat( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This exception is thrown by iter of CompactPerLayer(datapipe=SamplePerLayer, deduplicate=True)

During handling of the above exception, another exception occurred:

File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 203, in wrap_generator full_msg = f"{msg} {datapipe.class.name}({_generate_input_args_string(datapipe)})" File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 43, in _generate_input_args_string result.append((name, _simplify_obj_name(value))) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 27, in _simplify_obj_name return repr(obj) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/impl/fused_csc_sampling_graph.py", line 39, in repr csc_indptr_str = str(self.csc_indptr) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor.py", line 464, in repr return torch._tensor_str._str(self, tensor_contents=tensor_contents) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 697, in _str return _str_intern(self, tensor_contents=tensor_contents) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 617, in _str_intern tensor_str = _tensor_str(self, indent) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 349, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 375, in get_summarized_data return torch.cat( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This exception is thrown by iter of CompactPerLayer(datapipe=SamplePerLayer, deduplicate=True)

During handling of the above exception, another exception occurred:

File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/dataloader.py", line 68, in iter yield from self.dataloader File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 41, in fetch data = next(self.dataset_iter) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 152, in next return self._get_next() File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 140, in _get_next result = next(self.iterator) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 224, in wrap_next result = next_func(*args, **kwargs) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/datapipe.py", line 383, in next return next(self._datapipe_iter) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 203, in wrap_generator full_msg = f"{msg} {datapipe.class.name}({_generate_input_args_string(datapipe)})" File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 43, in _generate_input_args_string result.append((name, _simplify_obj_name(value))) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 27, in _simplify_obj_name return repr(obj) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/impl/fused_csc_sampling_graph.py", line 39, in repr csc_indptr_str = str(self.csc_indptr) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor.py", line 464, in repr return torch._tensor_str._str(self, tensor_contents=tensor_contents) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 697, in _str return _str_intern(self, tensor_contents=tensor_contents) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 617, in _str_intern tensor_str = _tensor_str(self, indent) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 349, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 375, in get_summarized_data return torch.cat( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This exception is thrown by iter of CompactPerLayer(datapipe=SamplePerLayer, deduplicate=True)

During handling of the above exception, another exception occurred:

File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 375, in get_summarized_data return torch.cat( File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 385, in return torch.stack([get_summarized_data(x) for x in (start + end)]) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 385, in get_summarized_data return torch.stack([get_summarized_data(x) for x in (start + end)]) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 349, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 617, in _str_intern tensor_str = _tensor_str(self, indent) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor_str.py", line 697, in _str return _str_intern(self, tensor_contents=tensor_contents) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/_tensor.py", line 464, in repr return torch._tensor_str._str(self, tensor_contents=tensor_contents) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/impl/torch_based_feature_store.py", line 225, in repr str(self._tensor), " " len(" feature=") File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/impl/torch_based_feature_store.py", line 432, in repr features_str = textwrap.indent(str(self._features), " ").strip() File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 27, in _simplify_obj_name return repr(obj) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 43, in _generate_input_args_string result.append((name, _simplify_obj_name(value))) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 203, in wrap_generator full_msg = f"{msg} {datapipe.class.name}({_generate_input_args_string(datapipe)})" File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/base.py", line 306, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/base.py", line 325, in iter for data in self.datapipe: File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/dgl/graphbolt/base.py", line 280, in iter yield from self.datapipe File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator response = gen.send(None) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/datapipe.py", line 383, in next return next(self._datapipe_iter) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 224, in wrap_next result = next_func(args, **kwargs) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 140, in _get_next result = next(self.iterator) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 152, in next return self._get_next() File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 41, in fetch data = next(self.dataset_iter) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/data/zzh/anaconda3/envs/YING/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/data/zzh/TEST_DIR/GraphBolt_异质图（链接预测有BUG）.py", line 696, in get_HGBl_amazon_GB for step, data in enumerate(dataloader): File "/data/zzh/TEST_DIR/GraphBolt_异质图（链接预测有BUG）.py", line 750, in get_HGBl_amazon_GB() File "/data/zzh/anaconda3/envs/YING/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/data/zzh/anaconda3/envs/YING/lib/python3.10/runpy.py", line 196, in _run_module_as_main (Current frame) return _run_code(code, main_globals, None, RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This exception is thrown by iter of Bufferer(datapipe=FeatureFetcher)

Rhett-Ying commented 3 months ago

how do you generate the train_set? are the Node IDs in each seed is edge type wised?

Ying-1106 commented 3 months ago

how do you generate the train_set? are the Node IDs in each seed is edge type wised?

I generate train_set with 2 numpy files. One is edge type P0P, another is edge type P1P as below:

tasks:

name: link_prediction num_classes: 2 train_set:
- type: "product:product-product-0:product" data:
  - name: seeds format: numpy path: {os.path.basename(train_set_POP_path)}
- type: "product:product-product-1:product" data:
  - name: seeds format: numpy path: {os.path.basename(train_set_P1P_path)}

the numpy array , this is the numpy array in train_set:

train_set_POP = np.load(train_set_POP_path) train_set_P1P = np.load(train_set_P1P_path) prin(train_set_P0P): train_set_POP array([[ 552, 7161], [8166, 9154], [2310, 2945], ..., [1367, 4038], [ 728, 7947], [5994, 5039]])

print(train_set_P1P): train_set_P1P array([[ 454, 8906], [7462, 9232], [8126, 359], ..., [4892, 731], [6761, 3064], [8407, 9684]])

Rhett-Ying commented 3 months ago

In order to dive deep into the root cause, I recommend to narrow down the case with following suggestions.

does it crash on first iteration?
could you try with CPU sampling?
try with small fanout, single layer.

Ying-1106 commented 3 months ago

In order to dive deep into the root cause, I recommend to narrow down the case with following suggestions.

does it crash on first iteration?

could you try with CPU sampling?

try with small fanout, single layer.

Thank you for your patient response. I have now resolved the issue, and the code for link prediction and node classification on heterogeneous graphs is running correctly. The previous bug might have been due to inconsistent devices.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

dmlc / dgl

【GraphBolt】【HeteroGraph】HeteroGraph can not generate batch #7456

🐛 Bug

construct the Ondiskdataset from existed dglgraph

The edge information numpy files in train_set, val_set, and test_set have been stored locally, and each set includes the source and target node IDs of two types of edges, P-0-P and P-1-P

Train set

val set

test set

when i print train_set

the error

it's the error message:

I generate train_set with 2 numpy files. One is edge type P0P, another is edge type P1P as below:

the numpy array , this is the numpy array in train_set: