TannerGilbert / Tensorflow-Object-Detection-API-Train-Model

Train a object detection model with the Tensorflow Object Detection API and Tensorflow 2.
https://gilberttanner.com/blog/creating-your-own-objectdetector
MIT License
192 stars 104 forks source link

Use config: faster_rcnn_resnet50_v1_800x1333_coco17_gpu-8.config Error #14

Closed Shaiken closed 3 years ago

Shaiken commented 3 years ago

How fix it, i search the question a long time, please

my environment: rtx3070 cuda11.1 tf-nightly-gpu==2.5.0.dev20201226 Python ==3.8.5

Traceback (most recent call last): File "model_main_tf2_1.py", line 114, in tf.compat.v1.app.run() File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "model_main_tf2_1.py", line 104, in main model_lib_v2.train_loop( File "/code/model/research/object_detection/model_lib_v2.py", line 522, in train_loop train_input = strategy.experimental_distribute_datasets_from_function( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 337, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1147, in experimental_distribute_datasets_from_function return self.distribute_datasets_from_function(dataset_fn, options) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1138, in distribute_datasets_from_function return self._extended._distribute_datasets_from_function( # pylint: disable=protected-access File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 545, in _distribute_datasets_from_function return input_lib.get_distributed_datasets_from_function( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py", line 161, in get_distributed_datasets_from_function return DistributedDatasetsFromFunction(dataset_fn, input_workers, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py", line 1271, in init _create_datasets_from_function_with_input_context( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/input_lib.py", line 1935, in _create_datasets_from_function_with_input_context dataset = dataset_fn(ctx) File "/code/model/research/object_detection/model_lib_v2.py", line 513, in train_dataset_fn train_input = inputs.train_input( File "/code/model/research/object_detection/inputs.py", line 870, in train_input dataset = INPUT_BUILDER_UTIL_MAP['dataset_build']( File "/code/model/research/object_detection/builders/dataset_builder.py", line 228, in build batch_size = input_context.get_per_replica_batch_size(batch_size) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 516, in get_per_replica_batch_size raise ValueError("The global_batch_size %r is not divisible by " ValueError: The global_batch_size 16 is not divisible by num_replicas_in_sync 3

TannerGilbert commented 3 years ago

I never heard of the num_replicas_in_sync parameter, but I quickly searched for it in the object_detection folder, and it seems like it should be set to 1 by default.

For the problem at hand, a simple fix would probably be to set the batch_size inside the config file to something that can be divided by three, but I'd recommend to instead open an issue on the models repository so that the underlying problem can be found.

Shaiken commented 3 years ago

thank you very much, that's work...! i got another question, because i want build a docker image... i retry install all, i got a stranger error

Traceback (most recent call last): File "model_main_tf2_1.py", line 114, in tf.compat.v1.app.run() File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "model_main_tf2_1.py", line 104, in main model_lib_v2.train_loop( File "/code/model/research/object_detection/model_lib_v2.py", line 561, in train_loop load_fine_tune_checkpoint(detection_model, File "/code/model/research/object_detection/model_lib_v2.py", line 339, in load_fine_tune_checkpoint if not is_object_based_checkpoint(checkpoint_path): File "/code/model/research/object_detection/model_lib_v2.py", line 302, in is_object_based_checkpoint var_names = [var[0] for var in tf.train.list_variables(checkpoint_path)] File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/checkpoint_utils.py", line 112, in list_variables reader = load_checkpoint(ckpt_dir_or_file) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/checkpoint_utils.py", line 67, in load_checkpoint return py_checkpoint_reader.NewCheckpointReader(filename) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/py_checkpoint_reader.py", line 99, in NewCheckpointReader error_translator(e) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/py_checkpoint_reader.py", line 48, in error_translator raise errors_impl.OpError(None, None, error_message, errors_impl.UNKNOWN) tensorflow.python.framework.errors_impl.OpError: not an sstable (bad magic number)


my training/config fine_tune_checkpoint: "/code/model/research/object_detection/faster_rcnn_resnet50_v1_800x1333_coco17_gpu-8/checkpoint/ckpt-0"

that look good

TannerGilbert commented 3 years ago

Sorry for the late reply. I just took a look and didn't find a solution. Maybe open an issue on the models repository so more people can help you.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days