facebookresearch / Detectron

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.
Apache License 2.0
26.27k stars 5.46k forks source link

Testing fails with Inception_Resnetv2 CONV_BODY after successful training #457

Open srikanth-kilaru opened 6 years ago

srikanth-kilaru commented 6 years ago

After training the model, I get the following errors during test. command: python2 tools/test_net.py --cfg configs/getting_started/ml349_2gpu_e2e_faster_rcnn Inception_ResNetv2.yaml --multi-gpu-testing TEST.WEIGHTS /tmp/detectron-output/train/coco_2014_train/generalized rcnn/model_final.pkl NUM_GPUS 2 Errors

================== INFO net.py: 88: conv1-7x7_s2_w not found INFO net.py: 88: conv1-7x7_s2_b not found INFO net.py: 88: conv2-3x3_reduce_w not found INFO net.py: 88: conv2-3x3_reduce_b not found INFO net.py: 88: conv2-3x3_w not found INFO net.py: 88: conv2-3x3_b not found INFO net.py: 88: inception_3a-1x1_w not found INFO net.py: 88: inception_3a-1x1_b not found INFO net.py: 88: inception_3a-3x3_reduce_w not found INFO net.py: 88: inception_3a-3x3_reduce_b not found INFO net.py: 88: inception_3a-3x3_w not found INFO net.py: 88: inception_3a-3x3_b not found INFO net.py: 88: inception_3a-5x5_reduce_w not found INFO net.py: 88: inception_3a-5x5_reduce_b not found INFO net.py: 88: inception_3a-5x5_w not found INFO net.py: 88: inception_3a-5x5_b not found INFO net.py: 88: inception_3a-pool_proj_w not found INFO net.py: 88: inception_3a-pool_proj_b not found INFO net.py: 88: inception_3b-1x1_w not found INFO net.py: 88: inception_3b-1x1_b not found INFO net.py: 88: inception_3b-3x3_reduce_w not found INFO net.py: 88: inception_3b-3x3_reduce_b not found INFO net.py: 88: inception_3b-3x3_w not found INFO net.py: 88: inception_3b-3x3_b not found INFO net.py: 88: inception_3b-5x5_reduce_w not found INFO net.py: 88: inception_3b-5x5_reduce_b not found INFO net.py: 88: inception_3b-5x5_w not found INFO net.py: 88: inception_3b-5x5_b not found INFO net.py: 88: inception_3b-pool_proj_w not found INFO net.py: 88: inception_3b-pool_proj_b not found INFO net.py: 88: inception_4a-1x1_w not found INFO net.py: 88: inception_4a-1x1_b not found INFO net.py: 88: inception_4a-3x3_reduce_w not found INFO net.py: 88: inception_4a-3x3_reduce_b not found INFO net.py: 88: inception_4a-3x3_w not found INFO net.py: 88: inception_4a-3x3_b not found INFO net.py: 88: inception_4a-5x5_reduce_w not found INFO net.py: 88: inception_4a-5x5_reduce_b not found INFO net.py: 88: inception_4a-5x5_w not found INFO net.py: 88: inception_4a-5x5_b not found INFO net.py: 88: inception_4a-pool_proj_w not found INFO net.py: 88: inception_4a-pool_proj_b not found INFO net.py: 88: conv_rpn_w not found INFO net.py: 88: conv_rpn_b not found INFO net.py: 88: rpn_cls_logits_w not found INFO net.py: 88: rpn_cls_logits_b not found INFO net.py: 88: rpn_bbox_pred_w not found INFO net.py: 88: rpn_bbox_pred_b not found INFO net.py: 88: head_conv1_w not found INFO net.py: 88: head_conv1_gn_s not found INFO net.py: 88: head_conv1_gn_b not found INFO net.py: 88: head_conv2_w not found INFO net.py: 88: head_conv2_gn_s not found INFO net.py: 88: head_conv2_gn_b not found INFO net.py: 88: head_conv3_w not found INFO net.py: 88: head_conv3_gn_s not found INFO net.py: 88: head_conv3_gn_b not found INFO net.py: 88: head_conv4_w not found INFO net.py: 88: head_conv4_gn_s not found INFO net.py: 88: head_conv4_gn_b not found INFO net.py: 88: _mask_fcn1_w not found INFO net.py: 88: _mask_fcn1_gn_s not found INFO net.py: 88: _mask_fcn1_gn_b not found INFO net.py: 88: _mask_fcn2_w not found INFO net.py: 88: _mask_fcn2_gn_s not found INFO net.py: 88: _mask_fcn2_gn_b not found INFO net.py: 88: _mask_fcn3_w not found INFO net.py: 88: _mask_fcn3_gn_s not found INFO net.py: 88: _mask_fcn3_gn_b not found INFO net.py: 88: _mask_fcn4_w not found INFO net.py: 88: _mask_fcn4_gn_s not found INFO net.py: 88: _mask_fcn4_gn_b not found INFO net.py: 88: conv5_mask_w not found INFO net.py: 88: conv5_mask_b not found INFO net.py: 88: mask_fcn_logits_w not found INFO net.py: 88: mask_fcn_logits_b not found I0529 11:23:29.991972 25591 net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 5.1028e-05 secs I0529 11:23:29.992153 25591 net_dag.cc:46] Number of parallel execution chains 25 Number of operators = 78 I0529 11:23:29.995748 25591 net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 3.4177e-05 secs I0529 11:23:29.995859 25591 net_dag.cc:46] Number of parallel execution chains 18 Number of operators = 53 I0529 11:23:29.997187 25591 net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 1.3739e-05 secs I0529 11:23:29.997264 25591 net_dag.cc:46] Number of parallel execution chains 1 Number of operators = 17 E0529 11:23:54.638962 26563 net_dag.cc:195] Exception from operator chain starting at '' (type 'Conv'): caffe2::EnforceNotMet: [enforce fail at blob.h:84] IsType(). wrong type for the Blob instance. Blob contains nullptr (uninitialized) while caller expects caffe2::Tensorcaffe2::CUDAContext . Offending Blob name: gpu_0/conv1-7x7_s2_w. Error from operator: input: "gpu_0/data" input: "gpu_0/conv1-7x7_s2_w" input: "gpu_0/conv1-7x7_s2_b" output: "gpu_0/conv1-7x7_s2" name: "" type: "Conv" arg { name: "kernel" i: 7 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 3 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 2 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN" WARNING workspace.py: 185: Original python traceback for operator 0 in network generalized_rcnn in exception above (most recent call last): WARNING workspace.py: 190: File "/home/srikilaru/detectron/tools/test_net.py", line 116, in WARNING workspace.py: 190: File "/home/srikilaru/detectron/detectron/core/test_engine.py", line 128, in run_inference WARNING workspace.py: 190: File "/home/srikilaru/detectron/detectron/core/test_engine.py", line 125, in result_getter WARNING workspace.py: 190: File "/home/srikilaru/detectron/detectron/core/test_engine.py", line 235, in test_net WARNING workspace.py: 190: File "/home/srikilaru/detectron/detectron/core/test_engine.py", line 328, in initialize_model_from_cfg WARNING workspace.py: 190: File "/home/srikilaru/detectron/detectron/modeling/model_builder.py", line 124, in create WARNING workspace.py: 190: File "/home/srikilaru/detectron/detectron/modeling/model_builder.py", line 89, in generalized_rcnn WARNING workspace.py: 190: File "/home/srikilaru/detectron/detectron/modeling/model_builder.py", line 230, in build_generic_detection_model WARNING workspace.py: 190: File "/home/srikilaru/detectron/detectron/modeling/optimizer.py", line 54, in build_data_parallel_model WARNING workspace.py: 190: File "/home/srikilaru/detectron/detectron/modeling/model_builder.py", line 170, in _single_gpu_build_func WARNING workspace.py: 190: File "/home/srikilaru/detectron/detectron/modeling/Inception_ResNetv2.py", line 27, in add_inception_resnetv2_xxs_conv5_body WARNING workspace.py: 190: File "/home/srikilaru/pytorch/build/caffe2/python/cnn.py", line 97, in Conv WARNING workspace.py: 190: File "/home/srikilaru/pytorch/build/caffe2/python/brew.py", line 107, in scope_wrapper WARNING workspace.py: 190: File "/home/srikilaru/pytorch/build/caffe2/python/helpers/conv.py", line 186, in conv WARNING workspace.py: 190: File "/home/srikilaru/pytorch/build/caffe2/python/helpers/conv.py", line 139, in _Conv INFO net.py: 88: inception_3b-pool_proj_b not found Base Traceback (most recent call last): File "/home/srikilaru/detectron/tools/test_net.py", line 116, in check_expected_results=True, File "/home/srikilaru/detectron/detectron/core/test_engine.py", line 128, in run_inference all_results = result_getter() File "/home/srikilaru/detectron/detectron/core/test_engine.py", line 125, in result_getter gpu_id=gpu_id File "/home/srikilaru/detectron/detectron/core/test_engine.py", line 258, in test_net model, im, box_proposals, timers File "/home/srikilaru/detectron/detectron/core/test.py", line 66, in im_detect_all model, im, cfg.TEST.SCALE, cfg.TEST.MAX_SIZE, boxes=box_proposals File "/home/srikilaru/detectron/detectron/core/test.py", line 158, in im_detect_bbox workspace.RunNet(model.net.Proto().name) File "/home/srikilaru/pytorch/build/caffe2/python/workspace.py", line 217, in RunNet StringifyNetName(name), num_iter, allow_fail, File "/home/srikilaru/pytorch/build/caffe2/python/workspace.py", line 178, in CallWithExceptionIntercept return func(*args, **kwargs) RuntimeError: [enforce fail at blob.h:84] IsType(). wrong type for the Blob instance. Blob contains nullptr (uni nitialized) while caller expects caffe2::Tensorcaffe2::CUDAContext . Offending Blob name: gpu_0/conv1-7x7_s2_w. Error from operator: input: "gpu_0/data" input: "gpu_0/conv1-7x7_s2_w" input: "gpu_0/conv1-7x7_s2_b" output: "gpu_0/conv1-7x7_s2" name: "" type: "Conv" arg { name: "kernel" i: 7 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 3 } arg { n ame: "order" s: "NCHW" } arg { name: "stride" i: 2 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN " Traceback (most recent call last): File "tools/test_net.py", line 116, in check_expected_results=True, File "/home/srikilaru/detectron/detectron/core/test_engine.py", line 128, in run_inference all_results = result_getter() File "/home/srikilaru/detectron/detectron/core/test_engine.py", line 108, in result_getter multi_gpu=multi_gpu_testing File "/home/srikilaru/detectron/detectron/core/test_engine.py", line 155, in test_net_on_dataset weights_file, dataset_name, proposal_file, num_images, output_dir File "/home/srikilaru/detectron/detectron/core/test_engine.py", line 188, in multi_gpu_test_net_on_dataset 'detection', num_images, binary, output_dir, opts File "/home/srikilaru/detectron/detectron/utils/subprocess.py", line 95, in process_in_parallel log_subprocess_output(i, p, output_dir, tag, start, end) File "/home/srikilaru/detectron/detectron/utils/subprocess.py", line 133, in log_subprocess_output assert ret == 0, 'Range subprocess failed (exit code: {})'.format(ret) AssertionError: Range subprocess failed (exit code: 1)

System information Operating system: ? Ubunt 16.04 LTS Compiler version: ? gcc version 6.4.0 20180424 (Ubuntu 6.4.0-17ubuntu1~16.04) CUDA version: ? vcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2018 NVIDIA Corporation Built on Wed_Apr_11_23:16:29_CDT_2018 Cuda compilation tools, release 9.2, V9.2.88 cuDNN version: ? 6.0.21 NVIDIA driver version: ? on May 28 00:31:39 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 396.26 Driver Version: 396.26 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 | | N/A 72C P8 31W / 149W | 6MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 Off | 00000000:00:05.0 Off | 0 | | N/A 36C P8 28W / 149W | 0MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1992 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+

GPU models (for all devices if they are not all the same): ? 2 K80 on GCP PYTHONPATH environment variable: ? /home/srikilaru/pytorch/build python --version output: ? Python 2.7.12 Anything else that seems relevant: ? Caffe2 installed as per instructions on Detectron and Caffe2 site Ubuntu 16.04 and

youngwanLEE commented 6 years ago

@srikanth-kilaru , Could you let me know how to deal with batchnorm layer of IRv2 networks ?

Did you convert batchnorm parameters to AffineChannel paramters and freeze those?

I'm also adopt other backbone networks to detectron. but I confuse how to deal with batchnorm parameters. The detectron team said batchnorm layers are replaced with AffineChannel and not mentioned how to freeze the parameters.