TannerGilbert / Tensorflow-Object-Detection-API-Train-Model

Train a object detection model with the Tensorflow Object Detection API and Tensorflow 2.
https://gilberttanner.com/blog/creating-your-own-objectdetector
MIT License
194 stars 104 forks source link

There are non-GPU devices - GPU devices not detected. #37

Open StormWeaver opened 2 years ago

StormWeaver commented 2 years ago

Describe the bug After running the build docker I receive an error stating that no GPU's were detected along with a failure to run the model_main_tf2.py script.

There were several different suggestions and solutions between your repo and Tensorflow around similar issues so I attempted a few of them...

I tried to install the Nvidia-Docker directly with SUDO however a password prompt appeared and my attempts to set a password in the docker-run section or to find a password did not lead to any success.

I have run my current docker files and the originals side by side and seem to get the same effect.

To Reproduce Steps to reproduce the behavior:

  1. Build docker following instructions on Git and/or blog page (ex : docker build -f research/object_detection/dockerfiles/tf2/Dockerfile -t od . ) (Docker - Linux Containers)
  2. Run docker (ex : docker run -it od)
  3. In docker after creating train.record and test.record successfully, attempt to run learing script (ex : python object_detection/model_main_tf2.py --pipeline_config_path=object_detection/training/ssd_efficientdet_d0_512x512_coco17_tpu-8.config --model_dir=object_detection/training/ --alsologtostderr)
  4. See error listed below.

WARNING:tensorflow:There are non-GPU devices in tf.distribute.Strategy, not using nccl allreduce. W0622 03:08:07.678852 140015840864064 cross_device_ops.py:1386] There are non-GPU devices in tf.distribute.Strategy, not using nccl allreduce. INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',) I0622 03:08:07.691202 140015840864064 mirrored_strategy.py:374] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',) INFO:tensorflow:Maybe overwriting train_steps: None I0622 03:08:07.694155 140015840864064 config_util.py:552] Maybe overwriting train_steps: None INFO:tensorflow:Maybe overwriting use_bfloat16: False I0622 03:08:07.694274 140015840864064 config_util.py:552] Maybe overwriting use_bfloat16: False I0622 03:08:07.699931 140015840864064 ssd_efficientnet_bifpn_feature_extractor.py:145] EfficientDet EfficientNet backbone version: efficientnet-b0 I0622 03:08:07.700029 140015840864064 ssd_efficientnet_bifpn_feature_extractor.py:147] EfficientDet BiFPN num filters: 64 I0622 03:08:07.700090 140015840864064 ssd_efficientnet_bifpn_feature_extractor.py:148] EfficientDet BiFPN num iterations: 3 I0622 03:08:07.702906 140015840864064 efficientnet_model.py:143] round_filter input=32 output=32 I0622 03:08:07.740919 140015840864064 efficientnet_model.py:143] round_filter input=32 output=32 I0622 03:08:07.741045 140015840864064 efficientnet_model.py:143] round_filter input=16 output=16 I0622 03:08:07.798309 140015840864064 efficientnet_model.py:143] round_filter input=16 output=16 I0622 03:08:07.798432 140015840864064 efficientnet_model.py:143] round_filter input=24 output=24 I0622 03:08:07.944522 140015840864064 efficientnet_model.py:143] round_filter input=24 output=24 I0622 03:08:07.944638 140015840864064 efficientnet_model.py:143] round_filter input=40 output=40 I0622 03:08:08.091527 140015840864064 efficientnet_model.py:143] round_filter input=40 output=40 I0622 03:08:08.091642 140015840864064 efficientnet_model.py:143] round_filter input=80 output=80 I0622 03:08:08.317637 140015840864064 efficientnet_model.py:143] round_filter input=80 output=80 I0622 03:08:08.317753 140015840864064 efficientnet_model.py:143] round_filter input=112 output=112 I0622 03:08:08.537171 140015840864064 efficientnet_model.py:143] round_filter input=112 output=112 I0622 03:08:08.537288 140015840864064 efficientnet_model.py:143] round_filter input=192 output=192 I0622 03:08:08.839897 140015840864064 efficientnet_model.py:143] round_filter input=192 output=192 I0622 03:08:08.840018 140015840864064 efficientnet_model.py:143] round_filter input=320 output=320 I0622 03:08:08.912957 140015840864064 efficientnet_model.py:143] round_filter input=1280 output=1280 I0622 03:08:08.947752 140015840864064 efficientnet_model.py:453] Building model efficientnet with params ModelConfig(width_coefficient=1.0, depth_coefficient=1.0, resolution=224, dropout_rate=0.2, blocks=(BlockConfig(input_filters=32, output_filters=16, kernel_size=3, num_repeat=1, expand_ratio=1, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=16, output_filters=24, kernel_size=3, num_repeat=2, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=24, output_filters=40, kernel_size=5, num_repeat=2, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=40, output_filters=80, kernel_size=3, num_repeat=3, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=80, output_filters=112, kernel_size=5, num_repeat=3, expand_ratio=6, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=112, output_filters=192, kernel_size=5, num_repeat=4, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=192, output_filters=320, kernel_size=3, num_repeat=1, expand_ratio=6, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise')), stem_base_filters=32, top_base_filters=1280, activation='simple_swish', batch_norm='default', bn_momentum=0.99, bn_epsilon=0.001, weight_decay=5e-06, drop_connect_rate=0.2, depth_divisor=8, min_depth=None, use_se=True, input_channels=3, num_classes=1000, model_name='efficientnet', rescale_input=False, data_format='channels_last', dtype='float32') WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function W0622 03:08:08.973673 140015840864064 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function INFO:tensorflow:Reading unweighted datasets: ['object_detection/training/train.record'] I0622 03:08:08.980458 140015840864064 dataset_builder.py:162] Reading unweighted datasets: ['object_detection/training/train.record'] INFO:tensorflow:Reading record datasets for input file: ['object_detection/training/train.record'] I0622 03:08:08.980628 140015840864064 dataset_builder.py:79] Reading record datasets for input file: ['object_detection/training/train.record'] INFO:tensorflow:Number of filenames to read: 0 I0622 03:08:08.980718 140015840864064 dataset_builder.py:80] Number of filenames to read: 0 Traceback (most recent call last): File "object_detection/model_main_tf2.py", line 120, in tf.compat.v1.app.run() File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 36, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 312, in run _run_main(main, args) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "object_detection/model_main_tf2.py", line 111, in main model_lib_v2.train_loop( File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 563, in train_loop train_input = strategy.experimental_distribute_datasets_from_function( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 357, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1195, in experimental_distribute_datasets_from_function return self.distribute_datasets_from_function(dataset_fn, options) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1186, in distribute_datasets_from_function return self._extended._distribute_datasets_from_function( # pylint: disable=protected-access File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 593, in _distribute_datasets_from_function return input_util.get_distributed_datasets_from_function( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_util.py", line 132, in get_distributed_datasets_from_function return input_lib.DistributedDatasetsFromFunction( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py", line 1372, in init self.build() File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py", line 1393, in build _create_datasets_from_function_with_input_context( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py", line 1875, in _create_datasets_from_function_with_input_context dataset = dataset_fn(ctx) File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py", line 554, in train_dataset_fn train_input = inputs.train_input( File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/inputs.py", line 908, in train_input dataset = INPUT_BUILDER_UTIL_MAP['dataset_build']( File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py", line 243, in build dataset = read_dataset( File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py", line 163, in read_dataset return _read_dataset_internal(file_read_func, input_files, File "/home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py", line 82, in _read_dataset_internal raise RuntimeError('Did not find any input files matching the glob pattern ' RuntimeError: Did not find any input files matching the glob pattern ['object_detection/training/train.record']

Expected behavior Based on the instructions I should see some form of image learning begin to occur but instead I receive a series of messages and errors suggesting the process has halted or failed.

Desktop

Additional context At one point during an attempt I received a slightly different message, however after trying some work arounds to build the Nvidia Docker these messages have not re-appeared in following attempts...

tensorflow@943f2e0f8488:~/models/research$ python object_detection/model_main_tf2.py --pipeline_config_path=object_detection/training/ssd_efficientdet_d0_512x512_coco17_tpu-8.config --model_dir=object_detection/training/ --alsologtostderr WARNING:tensorflow:There are non-GPU devices in tf.distribute.Strategy, not using nccl allreduce. W0622 00:20:06.607538 140269787281216 cross_device_ops.py:1386] There are non-GPU devices in tf.distribute.Strategy, not using nccl allreduce. INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',) I0622 00:20:06.611269 140269787281216 mirrored_strategy.py:374] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',) INFO:tensorflow:Maybe overwriting train_steps: None I0622 00:20:06.613795 140269787281216 config_util.py:552] Maybe overwriting train_steps: None INFO:tensorflow:Maybe overwriting use_bfloat16: False I0622 00:20:06.613889 140269787281216 config_util.py:552] Maybe overwriting use_bfloat16: False I0622 00:20:06.618684 140269787281216 ssd_efficientnet_bifpn_feature_extractor.py:145] EfficientDet EfficientNet backbone version: efficientnet-b0 I0622 00:20:06.618793 140269787281216 ssd_efficientnet_bifpn_feature_extractor.py:147] EfficientDet BiFPN num filters: 64 I0622 00:20:06.618869 140269787281216 ssd_efficientnet_bifpn_feature_extractor.py:148] EfficientDet BiFPN num iterations: 3 I0622 00:20:06.621786 140269787281216 efficientnet_model.py:143] round_filter input=32 output=32 I0622 00:20:06.710441 140269787281216 efficientnet_model.py:143] round_filter input=32 output=32 I0622 00:20:06.710567 140269787281216 efficientnet_model.py:143] round_filter input=16 output=16 I0622 00:20:06.767254 140269787281216 efficientnet_model.py:143] round_filter input=16 output=16 I0622 00:20:06.767369 140269787281216 efficientnet_model.py:143] round_filter input=24 output=24 I0622 00:20:06.913850 140269787281216 efficientnet_model.py:143] round_filter input=24 output=24 I0622 00:20:06.913977 140269787281216 efficientnet_model.py:143] round_filter input=40 output=40 I0622 00:20:07.055300 140269787281216 efficientnet_model.py:143] round_filter input=40 output=40 I0622 00:20:07.055412 140269787281216 efficientnet_model.py:143] round_filter input=80 output=80 I0622 00:20:07.269554 140269787281216 efficientnet_model.py:143] round_filter input=80 output=80 I0622 00:20:07.269668 140269787281216 efficientnet_model.py:143] round_filter input=112 output=112 I0622 00:20:07.485285 140269787281216 efficientnet_model.py:143] round_filter input=112 output=112 I0622 00:20:07.485399 140269787281216 efficientnet_model.py:143] round_filter input=192 output=192 I0622 00:20:07.789512 140269787281216 efficientnet_model.py:143] round_filter input=192 output=192 I0622 00:20:07.789628 140269787281216 efficientnet_model.py:143] round_filter input=320 output=320 I0622 00:20:07.861017 140269787281216 efficientnet_model.py:143] round_filter input=1280 output=1280 I0622 00:20:07.895739 140269787281216 efficientnet_model.py:453] Building model efficientnet with params ModelConfig(width_coefficient=1.0, depth_coefficient=1.0, resolution=224, dropout_rate=0.2, blocks=(BlockConfig(input_filters=32, output_filters=16, kernel_size=3, num_repeat=1, expand_ratio=1, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=16, output_filters=24, kernel_size=3, num_repeat=2, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=24, output_filters=40, kernel_size=5, num_repeat=2, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=40, output_filters=80, kernel_size=3, num_repeat=3, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=80, output_filters=112, kernel_size=5, num_repeat=3, expand_ratio=6, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=112, output_filters=192, kernel_size=5, num_repeat=4, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=192, output_filters=320, kernel_size=3, num_repeat=1, expand_ratio=6, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise')), stem_base_filters=32, top_base_filters=1280, activation='simple_swish', batch_norm='default', bn_momentum=0.99, bn_epsilon=0.001, weight_decay=5e-06, drop_connect_rate=0.2, depth_divisor=8, min_depth=None, use_se=True, input_channels=3, num_classes=1000, model_name='efficientnet', rescale_input=False, data_format='channels_last', dtype='float32') WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function W0622 00:20:07.921413 140269787281216 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function INFO:tensorflow:Reading unweighted datasets: ['object_detection/training/train.record'] I0622 00:20:07.925122 140269787281216 dataset_builder.py:162] Reading unweighted datasets: ['object_detection/training/train.record'] INFO:tensorflow:Reading record datasets for input file: ['object_detection/training/train.record'] I0622 00:20:07.925260 140269787281216 dataset_builder.py:79] Reading record datasets for input file: ['object_detection/training/train.record'] INFO:tensorflow:Number of filenames to read: 1 I0622 00:20:07.925341 140269787281216 dataset_builder.py:80] Number of filenames to read: 1 WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards. W0622 00:20:07.925419 140269787281216 dataset_builder.py:86] num_readers has been reduced to 1 to match input file shards. WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.deterministic. W0622 00:20:07.926657 140269787281216 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.deterministic. WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.map() W0622 00:20:07.940346 140269787281216 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.map() WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead. W0622 00:20:12.002727 140269787281216 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead. WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. W0622 00:20:14.406469 140269787281216 deprecation.py:350] From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. /home/tensorflow/.local/lib/python3.8/site-packages/keras/backend.py:450: UserWarning: tf.keras.backend.set_learning_phase is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to the training argument of the __call__ method of your layer or model. warnings.warn('tf.keras.backend.set_learning_phase is deprecated and ' WARNING:tensorflow:From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py:629: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version. Instructions for updating: Use fn_output_signature instead W0622 00:20:38.413714 140262147876608 deprecation.py:554] From /home/tensorflow/.local/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py:629: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version. Instructions for updating: Use fn_output_signature instead WARNING:tensorflow:Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument? W0622 00:20:45.288741 140262147876608 utils.py:76] Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument? WARNING:tensorflow:Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument? W0622 00:20:53.888063 140262147876608 utils.py:76] Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument? WARNING:tensorflow:Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument? W0622 00:21:02.133720 140262147876608 utils.py:76] Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument? WARNING:tensorflow:Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument? W0622 00:21:12.010699 140262147876608 utils.py:76] Gradients do not exist for variables ['top_bn/gamma:0', 'top_bn/beta:0'] when minimizing the loss. If you're using model.compile(), did you forget to provide a lossargument? Killed