Open Mintinson opened 1 month ago
It looks like the annotation file you downloaded is broken, try downloading it again.
Thanks for your answer!
I re-downloaded the dataset you guys placed on Google Drive and also re-ran the script extract_occupancy_ann.py
and it shows that everything is fine. But it still reports the same error when training.
I noticed that the README
under the data folder shows json files starting with embodiedscan_infos
, while the data extracted on Google Drive starts with embodiedscan
, does this matter? Do I have to change these filenames?
By the way, I would also like to know if this warning is normal? If not, what should I do to get rid of it.
09/06 03:16:31 - mmengine - Warning - Failed to search the “loop” registry tree for registries in the range “embodiedscan”. As a workaround, the current “loop” registry in “mmengine” is used to build the instance. This may cause unexpected failures when running the built module. Please check that “embodiedscan” is the correct scope, or that the registry is initialized.
09/06 03:16:31 - mmengine - Warning - euler-depth is not a metafile, just parsed as meta-information
@Mintinson
Could you please provide the sample_idx
of this scene?
Just replace
occ_masks = mmengine.load(mask_filename)
with
try:
occ_masks = mmengine.load(mask_filename)
except:
print(info['sample_idx'])
raise ValueError
This helps us to localize the problem.
Here is the output:
scannet/scene0031_00
Traceback (most recent call last):
...
and here is the structure of the corresponding scene:
location: data/scannet/scans/scene0031_00/
scene0031_00
├── occupancy
│ ├── occupancy.npy
│ └── visible_occupancy.pkl
├── scene0031_00_2d-instance-filt.zip
├── scene0031_00_2d-instance.zip
├── scene0031_00_2d-label-filt.zip
├── scene0031_00_2d-label.zip
├── scene0031_00.aggregation.json
├── scene0031_00.sens
├── scene0031_00.txt
├── scene0031_00_vh_clean_2.0.010000.segs.json
├── scene0031_00_vh_clean_2.labels.ply
├── scene0031_00_vh_clean_2.ply
├── scene0031_00_vh_clean.aggregation.json
├── scene0031_00_vh_clean.ply
└── scene0031_00_vh_clean.segs.json
1 directory, 15 files
location: data/scannet/scans/posed_images/scene0031_00/
scene0031_00
├── 00000.jpg
├── 00000.png
├── 00000.txt
├── 00010.jpg
├── ...
├── 02750.txt
├── depth_intrinsic.txt
├── intrinsic.txt
location: data/embodiedscan_occupancy/scannet/scene0031_00/
scene0031_00
├── occupancy.npy
├── visible_occupancy.pkl
@Mintinson
Could you please check the the sha256
hash values of visible_occupancy.pkl
and occupancy.npy
?
The hash of visible_occupancy.pkl
is 405f14770ab2126e24282977d5f897d1b35569bfea3f60431d63351def49ef3a
and the hash of occupancy.npy
is da1b32fd3753626401446669f6df3edd3530783e784a5edee01e56c78eb6b5d1
.
Thank you so much for your help! I checked the hash value of visible_occupancy.pkl
and found that it was indeed different from the visible_occupancy.pkl
hash value within embodiedscan_occupancy, I deleted the occupancy folder in raw data and ran the script again:
python embodiedscan/converter/extract_occupancy_ann.py --src data/embodiedscan_occupancy --dst data
This time the file has the correct hash value! I'm not sure what went wrong the first time I extracted these annotations. But now train.py
is able to allow it without reporting errors!
I would like to ask how much memory this project needs to run, when I run train.py
it gets killed because of out of memory
.
The memory problem is caused by the design of mmengine
dataloader which will copy annotation files num_gpu * num_workers
times. We are trying to fix this problem.
For a quick solution, you can see #29 for detail.
I tried the above solution but it didn't work. I am wondering if 125 G of RAM is enough? Do I need more RAM so that I am able to replace my server earlier?
It usually costs ~140G RAM on my server. Maybe you can try setting fewer dataloader workers in config?
I will try that. Thank you for your timely help~
I would like to ask why this project is taking up so much RAM, all the projects I have done before have taken up less than 30G of memory on loading data, why is this reaching hundreds. Also, what are the GPU memory requirements for this project? So that I can allocate the hardware resources in time.
I apologize for the RAM memory problem. We are working on fixing it.
For GPU memory, the default setting of Embodiedscan Detection Task like mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py
requires ~20G GPU memory. It can be further reduced by decreasing batch size.
PS: The default setting totally uses ~600G RAM. I'm sorry for the previous incorrect response.
Prerequisite
Task
I'm using the official example scripts/configs for the officially supported tasks/models/datasets.
Branch
main branch https://github.com/open-mmlab/mmdetection3d
Environment
System environment: sys.platform: linux Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 793778121 GPU 0: NVIDIA A100-PCIE-40GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.3, V11.3.58 GCC: gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0 PyTorch: 1.11.0 PyTorch compiling details: PyTorch built with:
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.12.0 OpenCV: 4.10.0 MMEngine: 0.10.4
Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 793778121 Distributed launcher: none Distributed training: False GPU number: 1
Reproduces the problem - code sample
Reproduces the problem - command or script
Reproduces the problem - error message
Additional information
No response