Describe the bug
After running the build docker I receive an error stating that no GPU's were detected along with a failure to run the model_main_tf2.py script.
There were several different suggestions and solutions between your repo and Tensorflow around similar issues so I attempted a few of them...
changing the nvidia gpu apt-key (this appeared to be an issue at one point but reverting it recently seemed to not cause any change)
disabling gcloud and gsutil commands
adding a gpu_device_name check to the mode_main_XX.py
I tried to install the Nvidia-Docker directly with SUDO however a password prompt appeared and my attempts to set a password in the docker-run section or to find a password did not lead to any success.
I have run my current docker files and the originals side by side and seem to get the same effect.
To Reproduce
Steps to reproduce the behavior:
Build docker following instructions on Git and/or blog page (ex : docker build -f research/object_detection/dockerfiles/tf2/Dockerfile -t od . ) (Docker - Linux Containers)
Run docker (ex : docker run -it od)
In docker after creating train.record and test.record successfully, attempt to run learing script (ex : python object_detection/model_main_tf2.py --pipeline_config_path=object_detection/training/ssd_efficientdet_d0_512x512_coco17_tpu-8.config --model_dir=object_detection/training/ --alsologtostderr)
See error listed below.
WARNING:tensorflow:There are non-GPU devices in tf.distribute.Strategy, not using nccl allreduce.
W0622 03:08:07.678852 140015840864064 cross_device_ops.py:1386] There are non-GPU devices in tf.distribute.Strategy, not using nccl allreduce.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
I0622 03:08:07.691202 140015840864064 mirrored_strategy.py:374] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
INFO:tensorflow:Maybe overwriting train_steps: None
I0622 03:08:07.694155 140015840864064 config_util.py:552] Maybe overwriting train_steps: None
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I0622 03:08:07.694274 140015840864064 config_util.py:552] Maybe overwriting use_bfloat16: False
I0622 03:08:07.699931 140015840864064 ssd_efficientnet_bifpn_feature_extractor.py:145] EfficientDet EfficientNet backbone version: efficientnet-b0
I0622 03:08:07.700029 140015840864064 ssd_efficientnet_bifpn_feature_extractor.py:147] EfficientDet BiFPN num filters: 64
I0622 03:08:07.700090 140015840864064 ssd_efficientnet_bifpn_feature_extractor.py:148] EfficientDet BiFPN num iterations: 3
I0622 03:08:07.702906 140015840864064 efficientnet_model.py:143] round_filter input=32 output=32
I0622 03:08:07.740919 140015840864064 efficientnet_model.py:143] round_filter input=32 output=32
I0622 03:08:07.741045 140015840864064 efficientnet_model.py:143] round_filter input=16 output=16
I0622 03:08:07.798309 140015840864064 efficientnet_model.py:143] round_filter input=16 output=16
I0622 03:08:07.798432 140015840864064 efficientnet_model.py:143] round_filter input=24 output=24
I0622 03:08:07.944522 140015840864064 efficientnet_model.py:143] round_filter input=24 output=24
I0622 03:08:07.944638 140015840864064 efficientnet_model.py:143] round_filter input=40 output=40
I0622 03:08:08.091527 140015840864064 efficientnet_model.py:143] round_filter input=40 output=40
I0622 03:08:08.091642 140015840864064 efficientnet_model.py:143] round_filter input=80 output=80
I0622 03:08:08.317637 140015840864064 efficientnet_model.py:143] round_filter input=80 output=80
I0622 03:08:08.317753 140015840864064 efficientnet_model.py:143] round_filter input=112 output=112
I0622 03:08:08.537171 140015840864064 efficientnet_model.py:143] round_filter input=112 output=112
I0622 03:08:08.537288 140015840864064 efficientnet_model.py:143] round_filter input=192 output=192
I0622 03:08:08.839897 140015840864064 efficientnet_model.py:143] round_filter input=192 output=192
I0622 03:08:08.840018 140015840864064 efficientnet_model.py:143] round_filter input=320 output=320
I0622 03:08:08.912957 140015840864064 efficientnet_model.py:143] round_filter input=1280 output=1280
I0622 03:08:08.947752 140015840864064 efficientnet_model.py:453] Building model efficientnet with params ModelConfig(width_coefficient=1.0, depth_coefficient=1.0, resolution=224, dropout_rate=0.2, blocks=(BlockConfig(input_filters=32, output_filters=16, kernel_size=3, num_repeat=1, expand_ratio=1, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=16, output_filters=24, kernel_size=3, num_repeat=2, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=24, output_filters=40, kernel_size=5, num_repeat=2, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=40, output_filters=80, kernel_size=3, num_repeat=3, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=80, output_filters=112, kernel_size=5, num_repeat=3, expand_ratio=6, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=112, output_filters=192, kernel_size=5, num_repeat=4, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=192, output_filters=320, kernel_size=3, num_repeat=1, expand_ratio=6, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise')), stem_base_filters=32, top_base_filters=1280, activation='simple_swish', batch_norm='default', bn_momentum=0.99, bn_epsilon=0.001, weight_decay=5e-06, drop_connect_rate=0.2, depth_divisor=8, min_depth=None, use_se=True, input_channels=3, num_classes=1000, model_name='efficientnet', rescale_input=False, data_format='channels_last', dtype='float32')
WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
W0622 03:08:08.973673 140015840864064 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
INFO:tensorflow:Reading unweighted datasets: ['object_detection/training/train.record']
I0622 03:08:08.980458 140015840864064 dataset_builder.py:162] Reading unweighted datasets: ['object_detection/training/train.record']
INFO:tensorflow:Reading record datasets for input file: ['object_detection/training/train.record']
I0622 03:08:08.980628 140015840864064 dataset_builder.py:79] Reading record datasets for input file: ['object_detection/training/train.record']
INFO:tensorflow:Number of filenames to read: 0
I0622 03:08:08.980718 140015840864064 dataset_builder.py:80] Number of filenames to read: 0
Traceback (most recent call last):
File "object_detection/model_main_tf2.py", line 120, in
tf.compat.v1.app.run()
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 36, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "object_detection/model_main_tf2.py", line 111, in main
model_lib_v2.train_loop(
File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 563, in train_loop
train_input = strategy.experimental_distribute_datasets_from_function(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 357, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1195, in experimental_distribute_datasets_from_function
return self.distribute_datasets_from_function(dataset_fn, options)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1186, in distribute_datasets_from_function
return self._extended._distribute_datasets_from_function( # pylint: disable=protected-access
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 593, in _distribute_datasets_from_function
return input_util.get_distributed_datasets_from_function(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_util.py", line 132, in get_distributed_datasets_from_function
return input_lib.DistributedDatasetsFromFunction(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py", line 1372, in init
self.build()
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py", line 1393, in build
_create_datasets_from_function_with_input_context(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py", line 1875, in _create_datasets_from_function_with_input_context
dataset = dataset_fn(ctx)
File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 554, in train_dataset_fn
train_input = inputs.train_input(
File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/inputs.py", line 908, in train_input
dataset = INPUT_BUILDER_UTIL_MAP['dataset_build'](
File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py", line 243, in build
dataset = read_dataset(
File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py", line 163, in read_dataset
return _read_dataset_internal(file_read_func, input_files,
File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py", line 82, in _read_dataset_internal
raise RuntimeError('Did not find any input files matching the glob pattern '
RuntimeError: Did not find any input files matching the glob pattern ['object_detection/training/train.record']
Expected behavior
Based on the instructions I should see some form of image learning begin to occur but instead I receive a series of messages and errors suggesting the process has halted or failed.
Desktop
Windows 11 Pro
Chrome
102.0.5005.115
Additional context
At one point during an attempt I received a slightly different message, however after trying some work arounds to build the Nvidia Docker these messages have not re-appeared in following attempts...
tensorflow@943f2e0f8488:~/models/research$ python object_detection/model_main_tf2.py --pipeline_config_path=object_detection/training/ssd_efficientdet_d0_512x512_coco17_tpu-8.config --model_dir=object_detection/training/ --alsologtostderr
WARNING:tensorflow:There are non-GPU devices in tf.distribute.Strategy, not using nccl allreduce.
W0622 00:20:06.607538 140269787281216 cross_device_ops.py:1386] There are non-GPU devices in tf.distribute.Strategy, not using nccl allreduce.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
I0622 00:20:06.611269 140269787281216 mirrored_strategy.py:374] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
INFO:tensorflow:Maybe overwriting train_steps: None
I0622 00:20:06.613795 140269787281216 config_util.py:552] Maybe overwriting train_steps: None
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I0622 00:20:06.613889 140269787281216 config_util.py:552] Maybe overwriting use_bfloat16: False
I0622 00:20:06.618684 140269787281216 ssd_efficientnet_bifpn_feature_extractor.py:145] EfficientDet EfficientNet backbone version: efficientnet-b0
I0622 00:20:06.618793 140269787281216 ssd_efficientnet_bifpn_feature_extractor.py:147] EfficientDet BiFPN num filters: 64
I0622 00:20:06.618869 140269787281216 ssd_efficientnet_bifpn_feature_extractor.py:148] EfficientDet BiFPN num iterations: 3
I0622 00:20:06.621786 140269787281216 efficientnet_model.py:143] round_filter input=32 output=32
I0622 00:20:06.710441 140269787281216 efficientnet_model.py:143] round_filter input=32 output=32
I0622 00:20:06.710567 140269787281216 efficientnet_model.py:143] round_filter input=16 output=16
I0622 00:20:06.767254 140269787281216 efficientnet_model.py:143] round_filter input=16 output=16
I0622 00:20:06.767369 140269787281216 efficientnet_model.py:143] round_filter input=24 output=24
I0622 00:20:06.913850 140269787281216 efficientnet_model.py:143] round_filter input=24 output=24
I0622 00:20:06.913977 140269787281216 efficientnet_model.py:143] round_filter input=40 output=40
I0622 00:20:07.055300 140269787281216 efficientnet_model.py:143] round_filter input=40 output=40
I0622 00:20:07.055412 140269787281216 efficientnet_model.py:143] round_filter input=80 output=80
I0622 00:20:07.269554 140269787281216 efficientnet_model.py:143] round_filter input=80 output=80
I0622 00:20:07.269668 140269787281216 efficientnet_model.py:143] round_filter input=112 output=112
I0622 00:20:07.485285 140269787281216 efficientnet_model.py:143] round_filter input=112 output=112
I0622 00:20:07.485399 140269787281216 efficientnet_model.py:143] round_filter input=192 output=192
I0622 00:20:07.789512 140269787281216 efficientnet_model.py:143] round_filter input=192 output=192
I0622 00:20:07.789628 140269787281216 efficientnet_model.py:143] round_filter input=320 output=320
I0622 00:20:07.861017 140269787281216 efficientnet_model.py:143] round_filter input=1280 output=1280
I0622 00:20:07.895739 140269787281216 efficientnet_model.py:453] Building model efficientnet with params ModelConfig(width_coefficient=1.0, depth_coefficient=1.0, resolution=224, dropout_rate=0.2, blocks=(BlockConfig(input_filters=32, output_filters=16, kernel_size=3, num_repeat=1, expand_ratio=1, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=16, output_filters=24, kernel_size=3, num_repeat=2, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=24, output_filters=40, kernel_size=5, num_repeat=2, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=40, output_filters=80, kernel_size=3, num_repeat=3, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=80, output_filters=112, kernel_size=5, num_repeat=3, expand_ratio=6, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=112, output_filters=192, kernel_size=5, num_repeat=4, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=192, output_filters=320, kernel_size=3, num_repeat=1, expand_ratio=6, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise')), stem_base_filters=32, top_base_filters=1280, activation='simple_swish', batch_norm='default', bn_momentum=0.99, bn_epsilon=0.001, weight_decay=5e-06, drop_connect_rate=0.2, depth_divisor=8, min_depth=None, use_se=True, input_channels=3, num_classes=1000, model_name='efficientnet', rescale_input=False, data_format='channels_last', dtype='float32')
WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
W0622 00:20:07.921413 140269787281216 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
INFO:tensorflow:Reading unweighted datasets: ['object_detection/training/train.record']
I0622 00:20:07.925122 140269787281216 dataset_builder.py:162] Reading unweighted datasets: ['object_detection/training/train.record']
INFO:tensorflow:Reading record datasets for input file: ['object_detection/training/train.record']
I0622 00:20:07.925260 140269787281216 dataset_builder.py:79] Reading record datasets for input file: ['object_detection/training/train.record']
INFO:tensorflow:Number of filenames to read: 1
I0622 00:20:07.925341 140269787281216 dataset_builder.py:80] Number of filenames to read: 1
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
W0622 00:20:07.925419 140269787281216 dataset_builder.py:86] num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.deterministic.
W0622 00:20:07.926657 140269787281216 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.deterministic.
WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.data.Dataset.map() W0622 00:20:07.940346 140269787281216 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.map()
WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead.
W0622 00:20:12.002727 140269787281216 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead.
WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W0622 00:20:14.406469 140269787281216 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
/home/tensorflow/.local/lib/python3.8/site-packages/keras/backend.py:450: UserWarning: tf.keras.backend.set_learning_phase is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to the training argument of the __call__ method of your layer or model.
warnings.warn('tf.keras.backend.set_learning_phase is deprecated and '
WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py:629: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
W0622 00:20:38.413714 140262147876608 deprecation.py:554] From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py:629: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
WARNING:tensorflow:Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument?
W0622 00:20:45.288741 140262147876608 utils.py:76] Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument?
WARNING:tensorflow:Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument?
W0622 00:20:53.888063 140262147876608 utils.py:76] Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument?
WARNING:tensorflow:Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument?
W0622 00:21:02.133720 140262147876608 utils.py:76] Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument?
WARNING:tensorflow:Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument?
W0622 00:21:12.010699 140262147876608 utils.py:76] Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument?
Killed
Describe the bug After running the build docker I receive an error stating that no GPU's were detected along with a failure to run the model_main_tf2.py script.
There were several different suggestions and solutions between your repo and Tensorflow around similar issues so I attempted a few of them...
I tried to install the Nvidia-Docker directly with SUDO however a password prompt appeared and my attempts to set a password in the docker-run section or to find a password did not lead to any success.
I have run my current docker files and the originals side by side and seem to get the same effect.
To Reproduce Steps to reproduce the behavior:
Expected behavior Based on the instructions I should see some form of image learning begin to occur but instead I receive a series of messages and errors suggesting the process has halted or failed.
Desktop
Additional context At one point during an attempt I received a slightly different message, however after trying some work arounds to build the Nvidia Docker these messages have not re-appeared in following attempts...