multi crop testing error

zhujiagang commented 5 years ago

Thanks for sharing your excellent work and code! I've run the code using the provided trained model (ava_r101_lfb_nl_3l) for evaluating on the validation set of AVA 2.1. For the single crop testing, mAP is 25.1 vs your 26.9, perhaps because I use a downsampled videos to extract frames (short size 240). When trying multi crop testing by set AVA.TEST_MULTI_CROP as True, as you suggested, I got the following errors.

[INFO: test_net.py:  110]: Done ResetWorkspace...
[WARNING: test_net.py:  114]: Testing started...
[WARNING: cnn.py:   25]: [====DEPRECATE WARNING====]: you are creating an object from CNNModelHelper class which will be deprecated soon. Please use ModelHelper object with brew module. For more information, please refer to caffe2.ai and python/brew.py, python/brew_test.py for more information.
[INFO: ava.py:   98]: Finished loading annotations from
[INFO: ava.py:  100]:   data/ava/annotations/ava_val_predicted_boxes.csv
[INFO: ava.py:  101]: Number of unique boxes: 97148
[INFO: ava.py:  102]: Number of annotations: 0
[INFO: ava.py:  123]: 52391 keyframes used.
[INFO: ava.py:  285]: === AVA dataset summary ===
[INFO: ava.py:  286]: Split: val
[INFO: ava.py:  287]: Use LFB? True
[INFO: ava.py:  288]: Detection threshold: 0.85
[INFO: ava.py:  290]: Full evaluation? True
[INFO: ava.py:  291]: Spatial shift position: 1
[INFO: ava.py:  292]: Number of videos: 64
[INFO: ava.py:  295]: Number of frames: 1729931
[INFO: ava.py:  296]: Number of key frames: 52391
[INFO: ava.py:  297]: Number of boxes: 97148.
[INFO: ava_data_input.py:   51]: Creating the execution context for worker_ids: [100, 101, 102, 103], batch size: 2
[INFO: data_input_helper.py:  157]: CREATING EXECUTION CONTEXT
[INFO: data_input_helper.py:  164]: POOLS: {100: <multiprocessing.pool.Pool object at 0x7f402f54eb10>, 101: <multiprocessing.pool.Pool object at 0x7f40301ff850>, 102: <multiprocessing.pool.Pool object at 0x7f4023b8fa10>, 103: <multiprocessing.pool.Pool object at 0x7f402807f6d0>}
[INFO: data_input_helper.py:  165]: SHARED DATA LISTS: 4
[INFO: data_input_helper.py:  177]: worker_id: 100 list: 4
[INFO: data_input_helper.py:  179]: worker_id: 100 list keys: [100, 101, 102, 103]
[INFO: data_input_helper.py:  177]: worker_id: 101 list: 4
[INFO: data_input_helper.py:  179]: worker_id: 101 list keys: [100, 101, 102, 103]
[INFO: data_input_helper.py:  177]: worker_id: 102 list: 4
[INFO: data_input_helper.py:  179]: worker_id: 102 list keys: [100, 101, 102, 103]
Traceback (most recent call last):
  File "tools/test_net.py", line 204, in <module>
    main()
  File "tools/test_net.py", line 200, in main
    test_net()
  File "tools/test_net.py", line 86, in test_net
    shift=shift)
  File "tools/test_net.py", line 125, in test_one_crop
    test_model.build_model(lfb=lfb, suffix=suffix, shift=shift)
  File "/running_package/video-long-term-feature-banks/lib/models/model_builder_video.py", line 116, in build_model
    suffix=suffix,
  File "/running_package/video-long-term-feature-banks/lib/datasets/dataloader.py", line 131, in __init__
    self._create_data_input()
  File "/running_package/video-long-term-feature-banks/lib/datasets/dataloader.py", line 151, in _create_data_input
    self._context_execution(worker_ids)
  File "/running_package/video-long-term-feature-banks/lib/datasets/ava_data_input.py", line 55, in init
    num_processes, batch_size)
  File "/running_package/video-long-term-feature-banks/lib/datasets/data_input_helper.py", line 216, in _create_execution_context
    initargs=(shared_data_list,)
  File "/running_package/video-long-term-feature-banks/anaconda2/lib/python2.7/multiprocessing/__init__.py", line 232, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild)
  File "/running_package/video-long-term-feature-banks/anaconda2/lib/python2.7/multiprocessing/pool.py", line 161, in __init__
    self._repopulate_pool()
  File "/running_package/video-long-term-feature-banks/anaconda2/lib/python2.7/multiprocessing/pool.py", line 225, in _repopulate_pool
    w.start()
  File "/running_package/video-long-term-feature-banks/anaconda2/lib/python2.7/multiprocessing/process.py", line 130, in start
    self._popen = Popen(self)
  File "/running_package/video-long-term-feature-banks/anaconda2/lib/python2.7/multiprocessing/forking.py", line 121, in __init__
    self.pid = os.fork()
OSError: [Errno 11] Resource temporarily unavailable
[INFO: data_input_helper.py:  232]: Shutting down multiprocessing pools..
[INFO: data_input_helper.py:  234]: Shutting down pool 0
[INFO: data_input_helper.py:  234]: Shutting down pool 1
[INFO: data_input_helper.py:  234]: Shutting down pool 2
[INFO: data_input_helper.py:  234]: Shutting down pool 3
[INFO: data_input_helper.py:  241]: Pools closed
[INFO: data_input_helper.py:  232]: Shutting down multiprocessing pools..
[INFO: data_input_helper.py:  234]: Shutting down pool 0
[INFO: data_input_helper.py:  234]: Shutting down pool 1
[INFO: data_input_helper.py:  234]: Shutting down pool 2
[INFO: data_input_helper.py:  234]: Shutting down pool 3
[INFO: data_input_helper.py:  241]: Pools closed

The trained model can get performance of 23.6 in the first crop: AVA results wrote to detections_final_224_shift0_0.850.csv. It seems when combining the second crop causes the errors.

chaoyuaw commented 5 years ago

Hi thanks for your questions! Yes, I think using downsampled frames could affect performance. We train with short side as large as 320, and test with short size 256 pixels.

Does this always happen when testing the 2nd crop? It might be worth trying to rerun and see if it's something stochastic or transient. It might be also worth trying to reduce the number of processes to use (https://github.com/facebookresearch/video-long-term-feature-banks/blob/master/lib/datasets/dataloader.py#L75). Let me know how it goes.

zhujiagang commented 5 years ago

Thank you for your suggestions. I am sure it is not stochastic. Follow your advices, I reduce the number of processes to 1. Now the code goes well until the 18-th (2x3x3) crop, without speed reduction.

[INFO: checkpoints.py:  401]: Broadcasting gpu_0/lfb_nl2_out_b to
[INFO: checkpoints.py:  406]:  |-> gpu_1/lfb_nl2_out_b
[INFO: checkpoints.py:  406]:  |-> gpu_2/lfb_nl2_out_b
[INFO: checkpoints.py:  406]:  |-> gpu_3/lfb_nl2_out_b
[INFO: checkpoints.py:  406]:  |-> gpu_4/lfb_nl2_out_b
[INFO: checkpoints.py:  406]:  |-> gpu_5/lfb_nl2_out_b
[INFO: checkpoints.py:  406]:  |-> gpu_6/lfb_nl2_out_b
[INFO: checkpoints.py:  406]:  |-> gpu_7/lfb_nl2_out_b
[INFO: checkpoints.py:  401]: Broadcasting gpu_0/pred_w to
[INFO: checkpoints.py:  406]:  |-> gpu_1/pred_w
[INFO: checkpoints.py:  406]:  |-> gpu_2/pred_w
[INFO: checkpoints.py:  406]:  |-> gpu_3/pred_w
[INFO: checkpoints.py:  406]:  |-> gpu_4/pred_w
[INFO: checkpoints.py:  406]:  |-> gpu_5/pred_w
[INFO: checkpoints.py:  406]:  |-> gpu_6/pred_w
[INFO: checkpoints.py:  406]:  |-> gpu_7/pred_w
[INFO: checkpoints.py:  401]: Broadcasting gpu_0/pred_b to
[INFO: checkpoints.py:  406]:  |-> gpu_1/pred_b
[INFO: checkpoints.py:  406]:  |-> gpu_2/pred_b
[INFO: checkpoints.py:  406]:  |-> gpu_3/pred_b
[INFO: checkpoints.py:  406]:  |-> gpu_4/pred_b
[INFO: checkpoints.py:  406]:  |-> gpu_5/pred_b
[INFO: checkpoints.py:  406]:  |-> gpu_6/pred_b
[INFO: checkpoints.py:  406]:  |-> gpu_7/pred_b
[I net_async_base.h:205] Using specified CPU pool size: 32; device id: -1
[I net_async_base.h:210] Created new CPU pool, size: 32; device id: -1
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
/running_package/video-long-term-feature-banks/job.sh: line 46:  1250 Aborted                 (core dumped) ${CMD}

I can see the mAP of previous 17 crops. Perhaps because the workstations we use are different. Now I decide to use 2 scales (2x2x3=12) to get the final result.

chaoyuaw commented 5 years ago

Glad that you found a workaround! Let me know if you have further questions.

facebookresearch / video-long-term-feature-banks

multi crop testing error #5