Closed Cyuanwen closed 7 months ago
Hi, I think the issue is that the working process is using much more memory than your machine can handle (h5py cache).
Due to the large number of pre-extracted features, it is recommended to solve this issue following: 1) Use a machine with a larger capacity memory for training. 2) Reduce the number of GPUs used for training and reduce the batch size, as h5py's cache in different processes seems independent. Too many processes result in too much memory overhead. 3) Reduce the size of the feature file, the depth feature file seems to still have room for compression, e.g. it can be downsized to a 7x7 size for each depth image. 4) Might have some way to clear the h5py cache during training? I haven't looked into it carefully and suggest giving it a try.
If you have any other questions, please feel free to ask.
Thank you for your reply!
I have set the batch size to one, as follows:
"train_batch_size": 1,
"val_batch_size": 1,
I have reduced the number of GPUs to one also, as follows:
NUM_GPUS=1
But it isn't working. It raises the same error. I run the code in the docker and the memory usage is as follows :
It seems not a memory issue. My Cuda version is 9.2, which is different from your cuda11.1. Is it possible that this is the cause of the problem? But cuda9.2 is the requirement of Matterport simulator , how did you install this simulator?
You can try changing the --shm-size
setting of the docker container to make sure the working processes have enough cache space.
Thank you for your reply!
I have set the batch size to one, as follows:
"train_batch_size": 1, "val_batch_size": 1,
I have reduced the number of GPUs to one also, as follows:
NUM_GPUS=1
But it isn't working. It raises the same error. I run the code in the docker and the memory usage is as follows :
It seems not a memory issue. My Cuda version is 9.2, which is different from your cuda11.1. Is it possible that this is the cause of the problem? But cuda9.2 is the requirement of Matterport simulator , how did you install this simulator?
hi~Have you solved it yet? I have the same problem too。
Hi Mr Zihan, Sorry to bother you.
Thank you for your great work. I'm very interested in your GridMM project. However, I encountered a bug when attempting to run the Pretraining code "bash run_r2r.sh" (in the pretrain_src directory) as follows:
The bug seems to be related to h5py's inability to read the depth data from "depth.hdf5" (self.DepthDB = DepthFeaturesDB(os.path.join(semantic_map_dir,"depth.hdf5"))). But there doesn't seem to be anything wrong with the data, as when I try read the data one by one as follows:
there's nothing wrong, which is very confusing to me. Could you kindly assist me in resolving this bug? Thank you very much!