The testing process is frozen

SeaBird-Go commented 6 months ago

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

------------------------------------------------------------                                                                                                               
System environment:                                                                                                                                                        
    sys.platform: linux                                                                                                                                                    
    Python: 3.8.16 (default, Mar  2 2023, 03:21:46) [GCC 11.2.0]                                                                                                           
    CUDA available: True                                                                                                                                                   
    MUSA available: False                                                                                                                                                  
    numpy_random_seed: 1731274824                                                                                                                                          
    GPU 0,1,2,3: Tesla V100S-PCIE-32GB                                                                                                                                     
    CUDA_HOME: /usr/local/cuda-11.0                                                                                                                                        
    NVCC: Cuda compilation tools, release 11.0, V11.0.221                                                                                                                  
    GCC: gcc (Ubuntu 7.5.0-6ubuntu2) 7.5.0                                                                                                                                 
    PyTorch: 1.11.0                                                                                                                                                        
    PyTorch compiling details: PyTorch built with:                                                                                                                         
  - GCC 7.3                                                                                                                                                                
  - C++ Version: 201402                                                                                                                                                    
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_6
1;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,
code=compute_37
    TorchVision: 0.12.0
    OpenCV: 4.7.0
    MMEngine: 0.10.4

Runtime environment:
    cudnn_benchmark: False
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: 1731274824
    Distributed launcher: pytorch
    Distributed training: True
    GPU number: 2

Reproduces the problem - code sample

bash tools/dist_test.sh configs/grounding/mv-grounding_8xb12_embodiedscan-vg-9dof.py ckpts/mv-grounding.pth 2

Reproduces the problem - command or script

It would cause error by using your provided command python tools/test.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py work_dirs/mv-3ddet/epoch_12.pth --launcher="pytorch".

So I use the following command to run testing:

bash tools/dist_test.sh configs/grounding/mv-grounding_8xb12_embodiedscan-vg-9dof.py ckpts/mv-grounding.pth 2

Reproduces the problem - error message

It's always stucked with the following output.

Additional information

I want to test the official provided baseline results.

Tai-Wang commented 6 months ago

It's strange. Do you use slurm for distributed training and testing? The script is used for slurm-based testing. May also need to wait for other feedback from the community.

SeaBird-Go commented 6 months ago

Thanks for your reply. I guess I have found the reasons. The reason is that the running time is so long that it seems to be stuck. When I changed the logging interval to 1, I noticed that the data_time is almost 40 seconds. So, why is it so slow to load the dataset?

Also, I have identified that the main time-consuming part is in the MultiViewPipeline pipeline, it needs 30 s.

The batch size is 12, and the num_workers is 8.

mxh1999 commented 6 months ago

@SeaBird-Go In MultiViewPipeline, depth images will be converted to a 3D point cloud in global coordinates, which consumes a lot of time. To speed up this process, you can calculate the point cloud corresponding to each depth image in advance and store it in a file.

SeaBird-Go commented 6 months ago

I got it. Thanks a lot. By the way, I also want to know how long did you run evaluation on the testing set, and submit a baseline result into the challenge benchmark?

On my side, the evaluation in the testing set would need 4 days, it's much longer than the time in your provided log files. And I also want to know whether can I use less number of images to speed up when testing, the default is 50.

Tai-Wang commented 6 months ago

It typically takes several hours, so your case is abnormal. I am not sure whether it is related to your machines (such as CPU and GPU settings or other programs occupying a lot of related resources). For the data_time, you can refer to issue #39, your case is obviously much slower.

For the number of images used for inference, you can refer to Fig. 7-(b) in the appendix of our paper. The detection performance would significantly increase until more than 40 images were used. I am unsure about the curve for multi-view grounding experiments, but you can try to ablate this issue similarly. I would recommend reducing the number of points to shorten the data time for the Matterport3D part.

OpenRobotLab / EmbodiedScan