dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.23k stars 2.99k forks source link

preprocess_ondisk_dataset fails on absolute paths #7392

Closed robert-dsl closed 2 months ago

robert-dsl commented 2 months ago

🐛 Bug

I have created an OnDiskDataset for my graph, in order to use graphbolt. When trying to load it, it fails during preprocessing. The problem is that dgl tries to copy some files to a subfolder "preprocessed", but in order to do so, the output paths are defined using: out_feature["path"] = os.path.join( processed_dir_prefix, feature["path"].replace("pt", "npy") ) This would only work well if the given path does not contain a directory. In this case, since the second argument of os.path.join is a path starting with "/", the first part is simply ignored. The result is that the output path is the same as the input path and becaus of that the copying leads to a SameFileError

To Reproduce

Steps to reproduce the behavior:

  1. Save an OnDiskDataset on any path that starts with "/" (e.g. /Volumes/myvolume/mydataset)
  2. Try to load it using `dataset = gb.OnDiskDataset("/Volumes/myvolume/mydataset").load(tasks="link_prediction") This will throw a SameFileError:

Start to preprocess the on-disk dataset. SameFileError: '/Volumes/workspace_bucket/interim_50k/graph/train_collection-reverse-cclick-user.npy' and '/Volumes/workspace_bucket/interim_50k/graph/train_collection-reverse-cclick-user.npy' are the same file File , line 1 ----> 1 dataset = gb.OnDiskDataset(BASE_DIR, force_preprocess=True).load(tasks="link_prediction") File /local_disk0/.ephemeral_nfs/envs/pythonEnv-51245f64-0010-4b5a-a9a7-2c77f71c4401/lib/python3.11/site-packages/dgl/graphbolt/impl/ondisk_dataset.py:685, in OnDiskDataset.init(self, path, include_original_edge_id, force_preprocess, auto_cast_to_optimal_dtype) 675 def init( 676 self, 677 path: str, (...) 682 # Always call the preprocess function first. If already preprocessed, 683 # the function will return the original path directly. 684 self._dataset_dir = path --> 685 yaml_path = preprocess_ondisk_dataset( 686 path, 687 include_original_edge_id, 688 force_preprocess, 689 auto_cast_to_optimal_dtype, 690 ) 691 with open(yaml_path) as f: 692 self._yaml_data = yaml.load(f, Loader=yaml.loader.SafeLoader) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-51245f64-0010-4b5a-a9a7-2c77f71c4401/lib/python3.11/site-packages/dgl/graphbolt/impl/ondisk_dataset.py:485, in preprocess_ondisk_dataset(dataset_dir, include_original_edge_id, force_preprocess, auto_cast_to_optimal_dtype) 478 output_data["path"] = os.path.join( 479 processed_dir_prefix, 480 input_data["path"].replace("pt", "npy"), 481 ) 482 name = ( 483 input_data["name"] if "name" in input_data else None 484 ) --> 485 copy_or_convert_data( 486 os.path.join(dataset_dir, input_data["path"]), 487 os.path.join(dataset_dir, output_data["path"]), 488 input_data["format"], 489 output_data["format"], 490 within_int32=node_ids_within_int32 491 and name in NAMES_INDICATING_NODE_IDS, 492 ) 494 # 7. Save the output_config. 495 output_config_path = os.path.join(dataset_dir, preprocess_metadata_path) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-51245f64-0010-4b5a-a9a7-2c77f71c4401/lib/python3.11/site-packages/dgl/graphbolt/internal/utils.py:123, in copy_or_convert_data(input_path, output_path, input_format, output_format, in_memory, is_feature, within_int32) 121 # If the data does not need to be modified, just copy the file. 122 elif not within_int32: --> 123 shutil.copyfile(input_path, output_path) 124 return 125 else: 126 # If dim of the data is 1, reshape it to n * 1 and save it to output_path. File /usr/lib/python3.11/shutil.py:236, in copyfile(src, dst, follow_symlinks) 233 sys.audit("shutil.copyfile", src, dst) 235 if _samefile(src, dst): --> 236 raise SameFileError("{!r} and {!r} are the same file".format(src, dst)) 238 file_size = 0 239 for i, fn in enumerate([src, dst]):

Expected behavior

The preprocessing should start and should create files in the "preprocessing" subfolder.

Environment

Additional context

N.B. I think the code would also fail if I used a path that doesn't start with "/", but with a different error: it would result in an invalid path.

robert-dsl commented 2 months ago

I just found a docstring saying the file paths in the metadata yaml should be relative to the base dir. That should fix it. It might be nice to include that in the tutorial/example.