PracticalDL / Practical-Deep-Learning-Book

Official code repo for the O'Reilly Book - Practical Deep Learning for Cloud, Mobile & Edge
http://practicaldeeplearning.ai
MIT License
752 stars 314 forks source link

Bad inference after 70k step using faster_rcnn_resnet152_pets (model zoo) #135

Closed bm777 closed 4 years ago

bm777 commented 4 years ago

Hello to every one, Am training a bird specie object detection. I have 7(with 525 image, 75 pictures per class) classes and using:

software and libraire

model:

Laptop:

The level: loss:1.4247278, step: 75185 The loss still decreasing but slowly.

I started the training since yesterday and I stopped the training and continued at this morning at the saved checkpoint as you can know. But at this stage, I saved and export the model.*-7108 to test test prediction, but i did not get prediction and no boxes was drawn, only if I decrease the treshold from 0.6 to 0.1, i get some false and true prediction.

My question is: does that means my model was not learned very well? or should I continue the training until I reach 200k step.

still

bm777 commented 4 years ago

An now still learning: still2

sidgan commented 4 years ago

Hi @bm777 can you please clarify what code are you using, eg what notebook in the chapters are you referring to, and what dataset are you using?

For the loss plots - are they validation or training?

bm777 commented 4 years ago

Hi @sidgan, thanks in advance..

After lablised the dataset (with LabelImg), then I generated TFRecord after generated the xml files to train.csv and test.csv (and the map for label also). The dataset that I used now is created by me, it is available in my google_drive Am using the code from recommandation of last line of README.md(the chapter 14) (building a perfect cat locator) code from Tensorflow repo: the loss plot in blue is for validation. for training is here in orange:

loss

# The execution:
python model_main.py --logtostderr\n 
            --model_dir=training\n
            --pipeline_config_path=training/faster_rcnn_resnet152_pets.config
# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Binary to run train and evaluation on object detection model."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from absl import flags

import tensorflow.compat.v1 as tf

from object_detection import model_lib
tf.logging.set_verbosity(tf.logging.INFO)

flags.DEFINE_string(
    'model_dir', None, 'Path to output model directory '
    'where event and checkpoint files will be written.')
flags.DEFINE_string('pipeline_config_path', None, 'Path to pipeline config '
                    'file.')
flags.DEFINE_integer('num_train_steps', None, 'Number of train steps.')
flags.DEFINE_boolean('eval_training_data', False,
                     'If training data should be evaluated for this job. Note '
                     'that one call only use this in eval-only mode, and '
                     '`checkpoint_dir` must be supplied.')
flags.DEFINE_integer('sample_1_of_n_eval_examples', 1, 'Will sample one of '
                     'every n eval input examples, where n is provided.')
flags.DEFINE_integer('sample_1_of_n_eval_on_train_examples', 5, 'Will sample '
                     'one of every n train input examples for evaluation, '
                     'where n is provided. This is only used if '
                     '`eval_training_data` is True.')
flags.DEFINE_string(
    'checkpoint_dir', None, 'Path to directory holding a checkpoint.  If '
    '`checkpoint_dir` is provided, this binary operates in eval-only mode, '
    'writing resulting metrics to `model_dir`.')
flags.DEFINE_boolean(
    'run_once', False, 'If running in eval-only mode, whether to run just '
    'one round of eval vs running continuously (default).'
)
flags.DEFINE_integer(
    'max_eval_retries', 0, 'If running continuous eval, the maximum number of '
    'retries upon encountering tf.errors.InvalidArgumentError. If negative, '
    'will always retry the evaluation.'
)
FLAGS = flags.FLAGS

def main(unused_argv):
  flags.mark_flag_as_required('model_dir')
  flags.mark_flag_as_required('pipeline_config_path')
  config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir)

  train_and_eval_dict = model_lib.create_estimator_and_inputs(
      run_config=config,
      pipeline_config_path=FLAGS.pipeline_config_path,
      train_steps=FLAGS.num_train_steps,
      sample_1_of_n_eval_examples=FLAGS.sample_1_of_n_eval_examples,
      sample_1_of_n_eval_on_train_examples=(
          FLAGS.sample_1_of_n_eval_on_train_examples))
  estimator = train_and_eval_dict['estimator']
  train_input_fn = train_and_eval_dict['train_input_fn']
  eval_input_fns = train_and_eval_dict['eval_input_fns']
  eval_on_train_input_fn = train_and_eval_dict['eval_on_train_input_fn']
  predict_input_fn = train_and_eval_dict['predict_input_fn']
  train_steps = train_and_eval_dict['train_steps']

  if FLAGS.checkpoint_dir:
    if FLAGS.eval_training_data:
      name = 'training_data'
      input_fn = eval_on_train_input_fn
    else:
      name = 'validation_data'
      # The first eval input will be evaluated.
      input_fn = eval_input_fns[0]
    if FLAGS.run_once:
      estimator.evaluate(input_fn,
                         steps=None,
                         checkpoint_path=tf.train.latest_checkpoint(
                             FLAGS.checkpoint_dir))
    else:
      model_lib.continuous_eval(estimator, FLAGS.checkpoint_dir, input_fn,
                                train_steps, name, FLAGS.max_eval_retries)
  else:
    train_spec, eval_specs = model_lib.create_train_and_eval_specs(
        train_input_fn,
        eval_input_fns,
        eval_on_train_input_fn,
        predict_input_fn,
        train_steps,
        eval_on_train_data=False)

    # Currently only a single Eval Spec is allowed.
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])

# gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.65)
# sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
# with sess.as_default():
if __name__ == '__main__':
    gpus = tf.config.experimental.list_physical_devices('GPU')
    try:
        tf.config.experimental.set_memory_growth(gpus[0], True)
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        tf.app.run()
    except Exception as e:
        print(e)
sidgan commented 4 years ago

Looks like ResNet152 family of models are not available in TensorFlow 1.x versions which you seem to be using. In other words, you are using a ResNet152 TensorFlow 2.0 model in TensorFlow 1.0 which is not interoperable. For reference TensorFlow updated its Object Detection API to use 2.0 about four weeks ago, you can read more here. Look at the Model Zoo for TensorFlow 1.x and 2.x and it shows all the models that are available for each version.

I'd suggest starting with a faster and reliable model like the ssdlite_mobilenet_v2_coco model which is available in TensorFlow 1.x Object Detection API.

In case you haven't done already, I'd also recommend first trying on the task and code mentioned in Chapter 14 as that has been tried and tested to work, as of Oct 2019. Then, start adding your dataset and other options as necessary.

Thinking out loud about how a neural network would learn, it would need to see enough number of repeated patterns per class. The collected dataset is a great start, now some images have birds sitting vs flying vs with open wings, and hence having just 75 images, there might be a lack of different repeating patterns making it difficult to learn well. Eventually, it may happen that the detector learns to differentiate the location of a beak or head, etc. So, it ideally needs more patterns to return a high confidence prediction. For classification the dataset would probably suffice and it would learn the color of the beak, feathers, etc.

Exciting work! Running a detector on Nat Geo's video of birds would be rad. I look forward to some cool results :)

bm777 commented 4 years ago

Thank you a lot for appreciation of dataset, I sew in the link you gave me, it was the non-interoperability of TensorFlow 2.x model-ZOO on TensorFlow 1.x. thanks :)

I tested the code of chapter 14, it worked very well and the detection of the cat was about 94% :), I was happy...

Based on your recommandation:

Thank for appreciation and I will let you know if it work.

bm777 commented 4 years ago

Hi @sidgan

I read some paper of object detection related to one stage detector and two stage detector, i sew, there was using more than 900k steps to obtain the best result, for my case i don't know if i got some bad detection because i'm still in 500k steps? Or should I continue to augment the dataset with more image (sitting, flying, open wing and open beak) something like more than 200 image per classe?

I test sslite_mobile_v2_coco, i get some result after 400k steps.

I tried faster_rcnn_inception_v2_coco with lr=0.002 between(0-300k steps) and lr=0.0001between(300k-500k steps). _the plot related to faster_rcnn_inception_v2_coco model_ Here the plot of the training loss:

loss

Here the plot of the validation loss loss_val

but after 400k step, i got some good(with two detection for the single bird) result with a best accuracy ike here:

  1. for an individual bird in the image: 2detection

best1

  1. for 2+ bird in the image: bad1
sidgan commented 4 years ago

This looks promising. Great work!!

I've often found myself at the situation when you are - trying to figure out how to increase the accuracy of a model and what methods to employ, to increase data, or to increase the training time. The method that I employ, which I've learned from years in the industry is:

  1. Create a pipeline that can perform all the steps that you aim for. This makes it easier to eventually add improvements and see the results. Ideally, the pipeline should be simple that just works and is efficient. As mentioned in Chapter 7, this habit includes starting from a good an existing model such as one on a Model zoo or TensorFlow depot or something similar.
  2. Based on the results you get from the first pipeline iterate and improve on the datasets, as usually, this gives the most improvement further on. Most research papers usually train on datasets like COCO with 100s of thousands of images, on 80+ classes, and trying to get the state of the art on the metrics. For them, training long makes sense as even minor improvements help a lot in achieving a state-of-the-art while comparing against other papers. In contrast, the bird dataset is much smaller, so I would suggest spending some time on the dataset and increase the number of images and augmentations that will provide more improvement than training for longer. Augmentations will help but having more diversity in original images will be even better. One trick would be to notice which birds are giving the worst results and start adding images of those birds. To enable this you can use the Fatkun browser extension (covered in chapters 8 & 12) to download more labeled images.
  3. Perform multiple and parallel experiments by modifying model architecture and various hyperparameters.

Loop on 2 and 3 until you've attained the desired accuracy.

The website https://paperswithcode.com/sota/object-detection-on-coco is also good for looking at various publications with available code and how they compare against others in benchmarks.

bm777 commented 4 years ago

Hi @sidgan , Thank you for your appreciations.

I noticed your recommandation. I will loop on 2 and 3 until I get desired accuracy. you are right, and according to you, it is not the training time which gives a best result, but the couple of some stuff like:

The remaining step for me after your last response:

  1. Firstly, I will spent more time on dataset, trying to get a maximum number of images with more diversity, 300+ per class, It is huge, but I should do it :(
  2. Secondly, I will create a simple pipeline for data augmentation using tf.image. (resize, rotate, blur, saturation, grayscale, flip, and brightnes). The new DS will be DS7.
  3. Finaly, the training. According to the link you share with me(i will select EfficientDet which Ranked #1 on Object Detection on COCO minival ), I will train with lr=0.0001, architecture_model=faster_rcnn_inception_v2_coco then after EfficientDet, and compare the result.

Thank you again for your help.

**EfficientDet, code

sidgan commented 4 years ago

I'm closing this issue now since there has been no activity in the past 2 weeks.

bm777 commented 4 years ago

I'm closing this issue now since there has been no activity in the past 2 weeks.

Okay. But I will open it after finished my collection birds dataset if I found some issue.

Many thanks again for your help.