Open LavenderMP opened 5 years ago
Additional information: when i try to run train.py it consume a lot of gpu ram and print out a lot of chunk
Another additional information: when i digging into the faster_rcnn_nas_coco.config. I got found out about this:
# TODO(shlens): Only fixed_shape_resizer is currently supported for NASNet # featurization. The reason for this is that nasnet.py only supports # inputs with fully known shapes. We need to update nasnet.py to handle # shapes not known at compile time.
this mean that we can only use image 1200x1200 fixed size to train on naset?
I am also having the same issue, any luck ?
I am also having the same issue, any luck ?
As I said earlier: you have to feed the network with fixed images in this case 1200x1200 if i remembered correctly which is mean that you have to do all the annotations.
When i follow the tutorial it works fine for me even i was changing the custom dataset. it give perfectly performance, but i was looking up on model zoo model and i there is an better model named nas model. so i decide to train with new architecture nasnet. I still did the same step as tutorial suggest. but this time it give a ton of error `2019-05-20 15:33:58.688073: W tensorflow/core/common_runtime/bfc_allocator.cc:271] **** 2019-05-20 15:33:58.690676: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at concat_op.cc:153 : Resource exhausted: OOM when allocating tensor with shape[64,4032,9,9] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call return fn(*args) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[64,9,9,672] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node SecondStageFeatureExtractor/cell_14/comb_iter_1/right/separable_3x3_2/separable_conv2d-1-1-TransposeNCHWToNHWC-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\training\supervisor.py", line 994, in managed_session yield sess File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 770, in train sess, train_op, global_step, train_step_kwargs) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 487, in train_step run_metadata=run_metadata) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 929, in run run_metadata_ptr) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run run_metadata) File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[64,9,9,672] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node SecondStageFeatureExtractor/cell_14/comb_iter_1/right/separable_3x3_2/separable_conv2d-1-1-TransposeNCHWToNHWC-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "train.py", line 184, in
tf.app.run()
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\util\deprecation.py", line 324, in new_func
return func(*args, *kwargs)
File "train.py", line 180, in main
graph_hook_fn=graph_rewriter_fn)
File "C:\tensorflow1\models\research\object_detection\legacy\trainer.py", line 416, in train
saver=saver)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 785, in train
ignore_live_threads=ignore_live_threads)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\contextlib.py", line 130, in exit
self.gen.throw(type, value, traceback)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\training\supervisor.py", line 1004, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\training\supervisor.py", line 832, in stop
ignore_live_threads=ignore_live_threads)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\training\coordinator.py", line 389, in join
six.reraise(self._exc_info_to_raise)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\six.py", line 693, in reraise
raise value
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\training\coordinator.py", line 297, in stop_on_exception
yield
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\training\coordinator.py", line 495, in run
self.run_loop()
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\training\supervisor.py", line 1034, in run_loop
self._sv.global_step])
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
run_metadata_ptr)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
run_metadata)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
[[node FirstStageFeatureExtractor/cell_10/comb_iter_1/left/separable_5x5_1/pointwise_weights/read (defined at C:\tensorflow1\models\research\slim\nets\nasnet\nasnet_utils.py:191) ]]
[[{{node ConstantFoldingCtrl/Loss/RPNLoss/Loss/huber_loss/assert_broadcastable/AssertGuard/Switch_0}}]]
Caused by op 'FirstStageFeatureExtractor/cell_10/comb_iter_1/left/separable_5x5_1/pointwise_weights/read', defined at: File "train.py", line 184, in
tf.app.run()
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\util\deprecation.py", line 324, in new_func
return func(*args, kwargs)
File "train.py", line 180, in main
graph_hook_fn=graph_rewriter_fn)
File "C:\tensorflow1\models\research\object_detection\legacy\trainer.py", line 291, in train
clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
File "C:\tensorflow1\models\research\slim\deployment\model_deploy.py", line 193, in create_clones
outputs = model_fn(*args, *kwargs)
File "C:\tensorflow1\models\research\object_detection\legacy\trainer.py", line 204, in _create_losses
prediction_dict = detection_model.predict(images, true_image_shapes)
File "C:\tensorflow1\models\research\object_detection\meta_architectures\faster_rcnn_meta_arch.py", line 647, in predict
image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs)
File "C:\tensorflow1\models\research\object_detection\meta_architectures\faster_rcnn_meta_arch.py", line 978, in _extract_rpn_feature_maps
scope=self.first_stage_feature_extractor_scope))
File "C:\tensorflow1\models\research\object_detection\meta_architectures\faster_rcnn_meta_arch.py", line 163, in extract_proposal_features
return self._extract_proposal_features(preprocessed_inputs, scope)
File "C:\tensorflow1\models\research\object_detection\models\faster_rcnn_nas_feature_extractor.py", line 194, in _extract_proposal_features
final_endpoint='Cell_11')
File "C:\tensorflow1\models\research\slim\nets\nasnet\nasnet.py", line 448, in build_nasnet_large
current_step=current_step)
File "C:\tensorflow1\models\research\slim\nets\nasnet\nasnet.py", line 520, in _build_nasnet_base
current_step=current_step)
File "C:\tensorflow1\models\research\slim\nets\nasnet\nasnet_utils.py", line 337, in call
current_step)
File "C:\tensorflow1\models\research\slim\nets\nasnet\nasnet_utils.py", line 367, in _apply_conv_operation
self._use_bounded_activation)
File "C:\tensorflow1\models\research\slim\nets\nasnet\nasnet_utils.py", line 191, in _stacked_separable_conv
stride=stride)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 182, in func_with_args
return func(args, current_args)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 2778, in separable_convolution2d
outputs = layer.apply(inputs)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1227, in apply
return self.call(inputs, *args, kwargs)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\layers\base.py", line 530, in call
outputs = super(Layer, self).call(inputs, *args, kwargs)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 538, in call
self._maybe_build(inputs)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1603, in _maybe_build
self.build(input_shapes)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\keras\layers\convolutional.py", line 1342, in build
dtype=self.dtype)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\layers\base.py", line 435, in add_weight
getter=vs.get_variable)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 349, in add_weight
aggregation=aggregation)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\training\checkpointable\base.py", line 607, in _add_variable_with_custom_getter
kwargs_for_getter)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 1479, in get_variable
aggregation=aggregation)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 1220, in get_variable
aggregation=aggregation)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 530, in get_variable
return custom_getter(*custom_getter_kwargs)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 1750, in layer_variable_getter
return _model_variable_getter(getter, args, kwargs)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 1741, in _model_variable_getter
aggregation=aggregation)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 182, in func_with_args
return func(*args, current_args)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\framework\python\ops\variables.py", line 350, in model_variable
aggregation=aggregation)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 182, in func_with_args
return func(*args, *current_args)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\framework\python\ops\variables.py", line 277, in variable
aggregation=aggregation)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 499, in _true_getter
aggregation=aggregation)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 911, in _get_single_variable
aggregation=aggregation)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variables.py", line 213, in call
return cls._variable_v1_call(args, kwargs)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variables.py", line 176, in _variable_v1_call
aggregation=aggregation)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variables.py", line 155, in
previous_getter = lambda kwargs: default_variable_creator(None, kwargs)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 2495, in default_variable_creator
expected_shape=expected_shape, import_scope=import_scope)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variables.py", line 217, in call
return super(VariableMetaclass, cls).call(*args, kwargs)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variables.py", line 1395, in init
constraint=constraint)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\variables.py", line 1557, in _init_from_args
self._snapshot = array_ops.identity(self._variable, name="read")
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\util\dispatch.py", line 180, in wrapper
return target(*args, *kwargs)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\array_ops.py", line 81, in identity
ret = gen_array_ops.identity(input, name=name)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 3890, in identity
"Identity", input=input, name=name)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
return func(args, kwargs)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 3300, in create_op
op_def=op_def)
File "C:\Users\hvhnk\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 1801, in init
self._traceback = tf_stack.extract_stack()
InternalError (see above for traceback): Dst tensor is not initialized. [[node FirstStageFeatureExtractor/cell_10/comb_iter_1/left/separable_5x5_1/pointwise_weights/read (defined at C:\tensorflow1\models\research\slim\nets\nasnet\nasnet_utils.py:191) ]] [[{{node ConstantFoldingCtrl/Loss/RPNLoss/Loss/huber_loss/assert_broadcastable/AssertGuard/Switch_0}}]]`