h5py error: OSError: Can't read data (inflate() failed)

Cyuanwen commented 7 months ago

Hi Mr Zihan, Sorry to bother you.

Thank you for your great work. I'm very interested in your GridMM project. However, I encountered a bug when attempting to run the Pretraining code "bash run_r2r.sh" (in the pretrain_src directory) as follows:

    return _run_code(code, main_globals, None,
  File "/root/anaconda3/envs/GridMM/lib/python3.8/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "train_r2r.py", line 469, in <module>
    main(args)
  File "train_r2r.py", line 234, in main
    for step, (name, batch) in enumerate(meta_loader):
  File "/root/nav_alg/GridMM-main/pretrain_src/data/loader.py", line 100, in __iter__
    self.preload(loader_it)
  File "/root/nav_alg/GridMM-main/pretrain_src/data/loader.py", line 111, in preload
    self.batch = next(it)
  File "/root/nav_alg/GridMM-main/pretrain_src/data/loader.py", line 64, in __iter__
    batch = next(iter_)
  File "/root/anaconda3/envs/GridMM/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/root/anaconda3/envs/GridMM/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
    return self._process_data(data)
  File "/root/anaconda3/envs/GridMM/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
    data.reraise()
  File "/root/anaconda3/envs/GridMM/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
OSError: Caught OSError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/root/anaconda3/envs/GridMM/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/root/anaconda3/envs/GridMM/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/anaconda3/envs/GridMM/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/nav_alg/GridMM-main/pretrain_src/data/tasks.py", line 186, in __getitem__
    inputs = self.nav_db.get_input(idx, end_vp_type, return_img_probs=True)
  File "/root/nav_alg/GridMM-main/pretrain_src/data/dataset.py", line 717, in get_input
    last_vp_angles,grid_fts,grid_map,gridmap_pos_fts, target_patch_id = self.get_traj_pano_fts(scan, gt_path)
  File "/root/nav_alg/GridMM-main/pretrain_src/data/dataset.py", line 793, in get_traj_pano_fts
    self.global_semantic,self.global_position_x,self.global_position_y,self.global_mask,self.global_map,self.max_x,self.min_x,self.max_y,self.min_y,gridmap_pos_fts, target_patch_id = self.getGlobalMap(scan, vp)
  File "/root/nav_alg/GridMM-main/pretrain_src/data/dataset.py", line 357, in getGlobalMap
    depth = self.DepthDB.get_image_feature(scan_id,viewpoint_id)
  File "/root/nav_alg/GridMM-main/pretrain_src/data/dataset.py", line 66, in get_image_feature
    ft = self._feature_store[key][...][:].astype(np.uint16)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/root/anaconda3/envs/GridMM/lib/python3.8/site-packages/h5py/_hl/dataset.py", line 573, in __getitem__
    self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 182, in h5py.h5d.DatasetID.read
  File "h5py/_proxy.pyx", line 130, in h5py._proxy.dset_rw
  File "h5py/_proxy.pyx", line 84, in h5py._proxy.H5PY_H5Dread
OSError: Can't read data (inflate() failed)

The bug seems to be related to h5py's inability to read the depth data from "depth.hdf5" (self.DepthDB = DepthFeaturesDB(os.path.join(semantic_map_dir,"depth.hdf5"))). But there doesn't seem to be anything wrong with the data, as when I try read the data one by one as follows:

semantic_map_dir = "../datasets/R2R/features"
feature_store = h5py.File(os.path.join(semantic_map_dir,"depth.hdf5"), 'r')
for key in tqdm.tqdm(feature_store.keys()):
    ft = feature_store[key][...][:].astype(np.uint16)

there's nothing wrong, which is very confusing to me. Could you kindly assist me in resolving this bug? Thank you very much!

MrZihan commented 7 months ago

Hi, I think the issue is that the working process is using much more memory than your machine can handle (h5py cache).

Due to the large number of pre-extracted features, it is recommended to solve this issue following: 1) Use a machine with a larger capacity memory for training. 2) Reduce the number of GPUs used for training and reduce the batch size, as h5py's cache in different processes seems independent. Too many processes result in too much memory overhead. 3) Reduce the size of the feature file, the depth feature file seems to still have room for compression, e.g. it can be downsized to a 7x7 size for each depth image. 4) Might have some way to clear the h5py cache during training? I haven't looked into it carefully and suggest giving it a try.

If you have any other questions, please feel free to ask.

Cyuanwen commented 7 months ago

Thank you for your reply!

I have set the batch size to one, as follows:

  "train_batch_size": 1,
  "val_batch_size": 1,

I have reduced the number of GPUs to one also, as follows:

NUM_GPUS=1

But it isn't working. It raises the same error. I run the code in the docker and the memory usage is as follows :

It seems not a memory issue. My Cuda version is 9.2, which is different from your cuda11.1. Is it possible that this is the cause of the problem? But cuda9.2 is the requirement of Matterport simulator , how did you install this simulator?

MrZihan commented 7 months ago

You can try changing the --shm-size setting of the docker container to make sure the working processes have enough cache space.

gqsmmz commented 1 week ago

Thank you for your reply!

I have set the batch size to one, as follows:
  "train_batch_size": 1,
  "val_batch_size": 1,
I have reduced the number of GPUs to one also, as follows:
NUM_GPUS=1
But it isn't working. It raises the same error. I run the code in the docker and the memory usage is as follows :

It seems not a memory issue. My Cuda version is 9.2, which is different from your cuda11.1. Is it possible that this is the cause of the problem? But cuda9.2 is the requirement of Matterport simulator , how did you install this simulator?

hi~Have you solved it yet? I have the same problem too。

MrZihan / GridMM

h5py error: OSError: Can't read data (inflate() failed) #9