[Bug] Error occurs while running the train.py in the tools: _pickle.UnpicklingError: pickle data was truncated

Mintinson commented 1 month ago

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

System environment: sys.platform: linux Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 793778121 GPU 0: NVIDIA A100-PCIE-40GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.3, V11.3.58 GCC: gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0 PyTorch: 1.11.0 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.2
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.12.0 OpenCV: 4.10.0 MMEngine: 0.10.4

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 793778121 Distributed launcher: none Distributed training: False GPU number: 1

Reproduces the problem - code sample

python tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py --work-dir=work_dirs/mv-3ddet

Reproduces the problem - command or script

python tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py --work-dir=work_dirs/mv-3ddet

Reproduces the problem - error message

09/06 03:16:31 - mmengine - WARNING - Failed to search registry with scope "embodiedscan" in the "loop" registry tree. As a workaround, the current "loop" registry in "mmengine" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "embodiedscan" is a correct scope, or whether the registry is initialized.
09/06 03:16:31 - mmengine - WARNING - euler-depth is not a meta file, simply parsed as meta information
Traceback (most recent call last):
  File "tools/train.py", line 133, in <module>
    main()
  File "tools/train.py", line 129, in main
    runner.train()
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1728, in train
    self._train_loop = self.build_train_loop(
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1520, in build_train_loop
    loop = LOOPS.build(
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/loops.py", line 44, in __init__
    super().__init__(runner, dataloader)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/base_loop.py", line 26, in __init__
    self.dataloader = runner.build_dataloader(
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1370, in build_dataloader
    dataset = DATASETS.build(dataset_cfg)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/dataset/dataset_wrapper.py", line 223, in __init__
    self.dataset = DATASETS.build(dataset)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/root/wwf/EmbodiedScan/embodiedscan/datasets/embodiedscan_dataset.py", line 59, in __init__
    super().__init__(ann_file=ann_file,
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/dataset/base_dataset.py", line 247, in __init__
    self.full_init()
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/dataset/base_dataset.py", line 298, in full_init
    self.data_list = self.load_data_list()
  File "/root/wwf/EmbodiedScan/embodiedscan/datasets/embodiedscan_dataset.py", line 342, in load_data_list
    data_info = self.parse_data_info(raw_data_info)
  File "/root/wwf/EmbodiedScan/embodiedscan/datasets/embodiedscan_dataset.py", line 147, in parse_data_info
    info['ann_info'] = self.parse_ann_info(info)
  File "/root/wwf/EmbodiedScan/embodiedscan/datasets/embodiedscan_dataset.py", line 238, in parse_ann_info
    occ_masks = mmengine.load(mask_filename)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/fileio/io.py", line 856, in load
    obj = handler.load_from_fileobj(f, **kwargs)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/fileio/handlers/pickle_handler.py", line 12, in load_from_fileobj
    return pickle.load(file, **kwargs)
_pickle.UnpicklingError: pickle data was truncated

Additional information

No response

mxh1999 commented 1 month ago

It looks like the annotation file you downloaded is broken, try downloading it again.

Mintinson commented 1 month ago

Thanks for your answer!

I re-downloaded the dataset you guys placed on Google Drive and also re-ran the script extract_occupancy_ann.py and it shows that everything is fine. But it still reports the same error when training.

I noticed that the README under the data folder shows json files starting with embodiedscan_infos, while the data extracted on Google Drive starts with embodiedscan, does this matter? Do I have to change these filenames?

By the way, I would also like to know if this warning is normal? If not, what should I do to get rid of it.

09/06 03:16:31 - mmengine - Warning - Failed to search the “loop” registry tree for registries in the range “embodiedscan”. As a workaround, the current “loop” registry in “mmengine” is used to build the instance. This may cause unexpected failures when running the built module. Please check that “embodiedscan” is the correct scope, or that the registry is initialized.
09/06 03:16:31 - mmengine - Warning - euler-depth is not a metafile, just parsed as meta-information

mxh1999 commented 1 month ago

@Mintinson Could you please provide the sample_idx of this scene? Just replace

occ_masks = mmengine.load(mask_filename)

with

try:
    occ_masks = mmengine.load(mask_filename)
except:
    print(info['sample_idx'])
    raise ValueError

This helps us to localize the problem.

Mintinson commented 1 month ago

Here is the output:

scannet/scene0031_00
Traceback (most recent call last):
 ...

and here is the structure of the corresponding scene:

location: data/scannet/scans/scene0031_00/

scene0031_00
├── occupancy
│   ├── occupancy.npy
│   └── visible_occupancy.pkl
├── scene0031_00_2d-instance-filt.zip
├── scene0031_00_2d-instance.zip
├── scene0031_00_2d-label-filt.zip
├── scene0031_00_2d-label.zip
├── scene0031_00.aggregation.json
├── scene0031_00.sens
├── scene0031_00.txt
├── scene0031_00_vh_clean_2.0.010000.segs.json
├── scene0031_00_vh_clean_2.labels.ply
├── scene0031_00_vh_clean_2.ply
├── scene0031_00_vh_clean.aggregation.json
├── scene0031_00_vh_clean.ply
└── scene0031_00_vh_clean.segs.json

1 directory, 15 files

location: data/scannet/scans/posed_images/scene0031_00/

scene0031_00
├── 00000.jpg
├── 00000.png
├── 00000.txt
├── 00010.jpg
├── ...
├── 02750.txt
├── depth_intrinsic.txt
├── intrinsic.txt

location: data/embodiedscan_occupancy/scannet/scene0031_00/

scene0031_00
├── occupancy.npy
├── visible_occupancy.pkl

mxh1999 commented 1 month ago

@Mintinson Could you please check the the sha256 hash values of visible_occupancy.pkl and occupancy.npy? The hash of visible_occupancy.pkl is 405f14770ab2126e24282977d5f897d1b35569bfea3f60431d63351def49ef3a and the hash of occupancy.npy is da1b32fd3753626401446669f6df3edd3530783e784a5edee01e56c78eb6b5d1.

Mintinson commented 1 month ago

Thank you so much for your help! I checked the hash value of visible_occupancy.pkl and found that it was indeed different from the visible_occupancy.pkl hash value within embodiedscan_occupancy, I deleted the occupancy folder in raw data and ran the script again:

python embodiedscan/converter/extract_occupancy_ann.py --src data/embodiedscan_occupancy --dst data

This time the file has the correct hash value! I'm not sure what went wrong the first time I extracted these annotations. But now train.py is able to allow it without reporting errors!

I would like to ask how much memory this project needs to run, when I run train.py it gets killed because of out of memory.

mxh1999 commented 1 month ago

The memory problem is caused by the design of mmengine dataloader which will copy annotation files num_gpu * num_workers times. We are trying to fix this problem.

For a quick solution, you can see #29 for detail.

Mintinson commented 1 month ago

I tried the above solution but it didn't work. I am wondering if 125 G of RAM is enough? Do I need more RAM so that I am able to replace my server earlier?

mxh1999 commented 1 month ago

It usually costs ~140G RAM on my server. Maybe you can try setting fewer dataloader workers in config?

Mintinson commented 1 month ago

I will try that. Thank you for your timely help~

Mintinson commented 1 month ago

I would like to ask why this project is taking up so much RAM, all the projects I have done before have taken up less than 30G of memory on loading data, why is this reaching hundreds. Also, what are the GPU memory requirements for this project? So that I can allocate the hardware resources in time.

mxh1999 commented 1 month ago

I apologize for the RAM memory problem. We are working on fixing it. For GPU memory, the default setting of Embodiedscan Detection Task like mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py requires ~20G GPU memory. It can be further reduced by decreasing batch size.

PS: The default setting totally uses ~600G RAM. I'm sorry for the previous incorrect response.

OpenRobotLab / EmbodiedScan