EdjeElectronics / TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10

How to train a TensorFlow Object Detection Classifier for multiple object detection on Windows
Apache License 2.0
2.92k stars 1.3k forks source link

InternalError: #283

Open LavenderMP opened 5 years ago

LavenderMP commented 5 years ago

When i follow the tutorial it works fine for me even i was changing the custom dataset. it give perfectly performance, but i was looking up on model zoo model and i there is an better model named nas model. so i decide to train with new architecture nasnet. I still did the same step as tutorial suggest. but this time it give a ton of error `2019-05-20 15:33:58.688073: W tensorflow/core/common_runtime/bfc_allocator.cc:271] **** 2019-05-20 15:33:58.690676: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at concat_op.cc:153 : Resource exhausted: OOM when allocating tensor with shape[64,4032,9,9] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call return fn(*args) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[64,9,9,672] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node SecondStageFeatureExtractor/cell_14/comb_iter_1/right/separable_3x3_2/separable_conv2d-1-1-TransposeNCHWToNHWC-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[{{node gradients/FirstStageFeatureExtractor/cell_11/beginning_bn/FusedBatchNorm_grad/FusedBatchNormGrad}}]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\training\supervisor.py", line 994, in managed_session yield sess File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 770, in train sess, train_op, global_step, train_step_kwargs) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 487, in train_step run_metadata=run_metadata) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 929, in run run_metadata_ptr) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run run_metadata) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[64,9,9,672] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node SecondStageFeatureExtractor/cell_14/comb_iter_1/right/separable_3x3_2/separable_conv2d-1-1-TransposeNCHWToNHWC-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[node gradients/FirstStageFeatureExtractor/cell_11/beginning_bn/FusedBatchNorm_grad/FusedBatchNormGrad (defined at C:\tensorflow1\models\research\slim\deployment\model_deploy.py:263) ]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 184, in tf.app.run() File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run _sys.exit(main(argv)) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\util\deprecation.py", line 324, in new_func return func(*args, *kwargs) File "train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "C:\tensorflow1\models\research\object_detection\legacy\trainer.py", line 416, in train saver=saver) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 785, in train ignore_live_threads=ignore_live_threads) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\contextlib.py", line 130, in exit self.gen.throw(type, value, traceback) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\training\supervisor.py", line 1004, in managed_session self.stop(close_summary_writer=close_summary_writer) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\training\supervisor.py", line 832, in stop ignore_live_threads=ignore_live_threads) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\training\coordinator.py", line 389, in join six.reraise(self._exc_info_to_raise) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\six.py", line 693, in reraise raise value File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\training\coordinator.py", line 297, in stop_on_exception yield File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\training\coordinator.py", line 495, in run self.run_loop() File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\training\supervisor.py", line 1034, in run_loop self._sv.global_step]) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 929, in run run_metadata_ptr) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run run_metadata) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized. [[node FirstStageFeatureExtractor/cell_10/comb_iter_1/left/separable_5x5_1/pointwise_weights/read (defined at C:\tensorflow1\models\research\slim\nets\nasnet\nasnet_utils.py:191) ]] [[{{node ConstantFoldingCtrl/Loss/RPNLoss/Loss/huber_loss/assert_broadcastable/AssertGuard/Switch_0}}]]

Caused by op 'FirstStageFeatureExtractor/cell_10/comb_iter_1/left/separable_5x5_1/pointwise_weights/read', defined at: File "train.py", line 184, in tf.app.run() File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run _sys.exit(main(argv)) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\util\deprecation.py", line 324, in new_func return func(*args, kwargs) File "train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "C:\tensorflow1\models\research\object_detection\legacy\trainer.py", line 291, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "C:\tensorflow1\models\research\slim\deployment\model_deploy.py", line 193, in create_clones outputs = model_fn(*args, *kwargs) File "C:\tensorflow1\models\research\object_detection\legacy\trainer.py", line 204, in _create_losses prediction_dict = detection_model.predict(images, true_image_shapes) File "C:\tensorflow1\models\research\object_detection\meta_architectures\faster_rcnn_meta_arch.py", line 647, in predict image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs) File "C:\tensorflow1\models\research\object_detection\meta_architectures\faster_rcnn_meta_arch.py", line 978, in _extract_rpn_feature_maps scope=self.first_stage_feature_extractor_scope)) File "C:\tensorflow1\models\research\object_detection\meta_architectures\faster_rcnn_meta_arch.py", line 163, in extract_proposal_features return self._extract_proposal_features(preprocessed_inputs, scope) File "C:\tensorflow1\models\research\object_detection\models\faster_rcnn_nas_feature_extractor.py", line 194, in _extract_proposal_features final_endpoint='Cell_11') File "C:\tensorflow1\models\research\slim\nets\nasnet\nasnet.py", line 448, in build_nasnet_large current_step=current_step) File "C:\tensorflow1\models\research\slim\nets\nasnet\nasnet.py", line 520, in _build_nasnet_base current_step=current_step) File "C:\tensorflow1\models\research\slim\nets\nasnet\nasnet_utils.py", line 337, in call current_step) File "C:\tensorflow1\models\research\slim\nets\nasnet\nasnet_utils.py", line 367, in _apply_conv_operation self._use_bounded_activation) File "C:\tensorflow1\models\research\slim\nets\nasnet\nasnet_utils.py", line 191, in _stacked_separable_conv stride=stride) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 182, in func_with_args return func(args, current_args) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 2778, in separable_convolution2d outputs = layer.apply(inputs) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1227, in apply return self.call(inputs, *args, kwargs) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\layers\base.py", line 530, in call outputs = super(Layer, self).call(inputs, *args, kwargs) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 538, in call self._maybe_build(inputs) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1603, in _maybe_build self.build(input_shapes) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\keras\layers\convolutional.py", line 1342, in build dtype=self.dtype) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\layers\base.py", line 435, in add_weight getter=vs.get_variable) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 349, in add_weight aggregation=aggregation) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\training\checkpointable\base.py", line 607, in _add_variable_with_custom_getter kwargs_for_getter) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 1479, in get_variable aggregation=aggregation) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 1220, in get_variable aggregation=aggregation) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 530, in get_variable return custom_getter(*custom_getter_kwargs) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 1750, in layer_variable_getter return _model_variable_getter(getter, args, kwargs) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 1741, in _model_variable_getter aggregation=aggregation) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 182, in func_with_args return func(*args, current_args) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\framework\python\ops\variables.py", line 350, in model_variable aggregation=aggregation) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 182, in func_with_args return func(*args, *current_args) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\framework\python\ops\variables.py", line 277, in variable aggregation=aggregation) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 499, in _true_getter aggregation=aggregation) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 911, in _get_single_variable aggregation=aggregation) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variables.py", line 213, in call return cls._variable_v1_call(args, kwargs) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variables.py", line 176, in _variable_v1_call aggregation=aggregation) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variables.py", line 155, in previous_getter = lambda kwargs: default_variable_creator(None, kwargs) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 2495, in default_variable_creator expected_shape=expected_shape, import_scope=import_scope) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variables.py", line 217, in call return super(VariableMetaclass, cls).call(*args, kwargs) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variables.py", line 1395, in init constraint=constraint) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variables.py", line 1557, in _init_from_args self._snapshot = array_ops.identity(self._variable, name="read") File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\util\dispatch.py", line 180, in wrapper return target(*args, *kwargs) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\array_ops.py", line 81, in identity ret = gen_array_ops.identity(input, name=name) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 3890, in identity "Identity", input=input, name=name) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func return func(args, kwargs) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 3300, in create_op op_def=op_def) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 1801, in init self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): Dst tensor is not initialized. [[node FirstStageFeatureExtractor/cell_10/comb_iter_1/left/separable_5x5_1/pointwise_weights/read (defined at C:\tensorflow1\models\research\slim\nets\nasnet\nasnet_utils.py:191) ]] [[{{node ConstantFoldingCtrl/Loss/RPNLoss/Loss/huber_loss/assert_broadcastable/AssertGuard/Switch_0}}]]`

LavenderMP commented 5 years ago

Additional information: when i try to run train.py it consume a lot of gpu ram and print out a lot of chunk

LavenderMP commented 5 years ago

Another additional information: when i digging into the faster_rcnn_nas_coco.config. I got found out about this:

  # TODO(shlens): Only fixed_shape_resizer is currently supported for NASNet
  # featurization. The reason for this is that nasnet.py only supports
  # inputs with fully known shapes. We need to update nasnet.py to handle
  # shapes not known at compile time.

this mean that we can only use image 1200x1200 fixed size to train on naset?

Theriyadh commented 5 years ago

I am also having the same issue, any luck ?

LavenderMP commented 5 years ago

I am also having the same issue, any luck ?

As I said earlier: you have to feed the network with fixed images in this case 1200x1200 if i remembered correctly which is mean that you have to do all the annotations.