facebookresearch / detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
https://detectron2.readthedocs.io/en/latest/
Apache License 2.0
30.65k stars 7.51k forks source link

Got LockException error during inference large-scale datasets #4089

Open coldmanck opened 2 years ago

coldmanck commented 2 years ago

Main Issues

I've modified plain_train_net.py to inference my larger-scale dataset which is around 2.6 million images with pre-trained object detection models. When I run with smaller number of data (say, 1000 imgs), the code works well. However, when I ran with my full dataset it encountered an error, portalocker.exceptions.LockException: [Errno 11] Resource temporarily unavailable. Full error message is as follows.

Traceback (most recent call last):
  File "plain_train_net.py", line 507, in <module>
    args=(params,),
  File "/home/tiger/anaconda/lib/python3.7/site-packages/detectron2/engine/launch.py", line 79, in launch
    daemon=False,
  File "/home/tiger/anaconda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/tiger/anaconda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/tiger/anaconda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 4 terminated with the following error:
Traceback (most recent call last):
  File "/home/tiger/anaconda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/tiger/anaconda/lib/python3.7/site-packages/detectron2/engine/launch.py", line 126, in _distributed_worker
    main_func(*args)
  File "/opt/tiger/my_project/detectron2/projects/STAD/plain_train_net.py", line 468, in main
    return do_test(cfg, model)
  File "/opt/tiger/my_project/detectron2/projects/STAD/plain_train_net.py", line 257, in do_test
    cfg, dataset_name, os.path.join(cfg.OUTPUT_DIR, "pred_person_proposals", f'cls_{cfg.TTLIVE_DATA.CLS}_{cfg.TTLIVE_DATA.PART}')
  File "/opt/tiger/my_project/detectron2/projects/STAD/plain_train_net.py", line 230, in get_evaluator
    evaluator_list.append(COCOPersonEvaluator(dataset_name, cfg, output_dir=output_folder))
  File "/opt/tiger/my_project/detectron2/projects/STAD/evaluation/evaluator.py", line 46, in __init__
    super().__init__(dataset_name, output_dir=output_dir, distributed=distributed)
  File "/home/tiger/anaconda/lib/python3.7/site-packages/detectron2/evaluation/coco_evaluation.py", line 130, in __init__
    convert_to_coco_json(dataset_name, cache_path)
  File "/home/tiger/anaconda/lib/python3.7/site-packages/detectron2/data/datasets/coco.py", line 462, in convert_to_coco_json
    with file_lock(output_file):
  File "/home/tiger/anaconda/lib/python3.7/site-packages/portalocker/utils.py", line 157, in __enter__
    return self.acquire()
  File "/home/tiger/anaconda/lib/python3.7/site-packages/portalocker/utils.py", line 272, in acquire
    raise exceptions.LockException(exception)
portalocker.exceptions.LockException: [Errno 11] Resource temporarily unavailable

This happened during the step of Trying to convert 'my_dataset' to COCO format ...

Basically, I do not understand what does this error message means, especially for my (larger-scale) dataset. I found that this error happened around 60-70 mins after the above prompt Trying to convert 'my_dataset' to COCO format ... logged on my screen. I did some simple researches and I found that it might be because of the timeout=3600 set for portalocker.Lock() in Facebook's iopath. So I suspect that it means I got a timeout error where one of my threads did something with the json file with the lock (.json.lock file) for more than one hour. If that's the case, isn't this timeout an unreasonable choice? How should I fix my issue? Thank you very much!

P.s. I found my error is almost the same as this faq; however, I don't think the solution doesn't help me as I run the command from the beginning, i.e., with no other experiments & without a json.lock file.

Instructions To Reproduce the Issue and Full Logs

What I did are only 1) write a custom function get_my_dataset_dicts() in plain_train_net.py and 2) ran it with 8 V100 gpus to collect the inference results (bounding boxes & scores).

Your Environment

----------------------  -------------------------------------------------------------------
sys.platform            linux
Python                  3.7.6 (default, Jan  8 2020, 19:59:22) [GCC 7.3.0]
numpy                   1.21.5
detectron2              0.6 @/home/tiger/.local/lib/python3.7/site-packages/detectron2
Compiler                GCC 7.3
CUDA compiler           CUDA 10.2
detectron2 arch flags   3.7, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.8.0 @/home/tiger/anaconda/lib/python3.7/site-packages/torch
PyTorch debug build     False
GPU available           Yes
GPU 0,1,2,3,4,5,6,7     Tesla V100-SXM2-32GB (arch=7.0)
Driver version          418.116.00
CUDA_HOME               /usr/local/cuda
Pillow                  9.0.1
torchvision             0.9.0 @/home/tiger/anaconda/lib/python3.7/site-packages/torchvision
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5
fvcore                  0.1.5.post20220305
iopath                  0.1.9
cv2                     4.5.5
----------------------  -------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

Testing NCCL connectivity ... this should not hang.
...
...
...
NCCL succeeded.
github-actions[bot] commented 2 years ago

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template. The following information is missing: "Instructions To Reproduce the Issue and Full Logs"; "Your Environment";

coldmanck commented 2 years ago

@ppwwyyxx I sincerely hope that you and the developers can have a look at my issue. Thank you :)

coldmanck commented 2 years ago

Follow-up: my issue was solved by implementing my own file_lock() function which modifies this line by increasing the timeout to 60*60*48 which means 48 hours, as my dataset needs to process for around 24 hours. I think maybe we shouldn't set a timeout (or, for example, set timeout=float('inf')).

But I'm still wondering if this way is the right way.

ppwwyyxx commented 2 years ago

You're right about the issue relating to timeout. We didn't anticipate that the dataset generation can take more than an hour.

It sounds like a good solution is to make the timeout of file_lock an argument, and call it with a larger timeout value (or infinite, if possible) from detectron2.

coldmanck commented 2 years ago

You're right about the issue relating to timeout. We didn't anticipate that the dataset generation can take more than an hour.

It sounds like a good solution is to make the timeout of file_lock an argument, and call it with a larger timeout value (or infinite, if possible) from detectron2.

Thanks for the reply. Yes, I do think the timeout=3600 value should be configurable.

Shreyz-max commented 2 years ago

If no one is working on it, can I make a PR @ppwwyyxx @coldmanck ?