Object detection - best weights never saved

cypamigon commented 1 week ago

Hello,

I'm trying to train an object detection model based on a custom dataset. I'm following the instructions provided in the README of the object_detection/src folder.

I've modified the user_config.yaml file according to my need and I'm running the training script with python stm32ai_main.py .

According to the instructions, best model weights since the beginning of the training should be automatically saved on the /experiments_outputs/"%Y_%m_%d_%H_%M_%S"/saved_models/ folder. However, the weights are never saved during the training (no best_weights.h5 in the folder).

At the end of the training process, when the scripts want to load the weights, an error is raised because the path doesn't exist !

I've tried to modify the keras.callbacks.ModelCheckpoint parameters to saved the weights at the end of each epoch (even if they are not the best) and it works (best_weights.h5 are saved in the saved_models folder).*

I've replace :

    # Add the Keras callback that saves the best model obtained so far
    callback = tf.keras.callbacks.ModelCheckpoint(
                        filepath= os.path.join(output_dir, saved_models_dir, model_file_name),
                        save_best_only=True,
                        save_weights_only=save_only_weights, #save_only_weights = True
                        monitor="val_loss",
                        mode="min")
    callback_list.append(callback)

with :

    # Add the Keras callback that saves the best model obtained so far
    callback = tf.keras.callbacks.ModelCheckpoint(
                        filepath= os.path.join(output_dir, saved_models_dir, model_file_name),
                        save_best_only=False,
                        save_weights_only=save_only_weights, #Tsave_only_weights = True
                        monitor="val_loss",
                        mode="min")
    callback_list.append(callback)

However, I would like to save the best weights since the begining of the training in order to get the more efficient model. Do you have any idea on what could prevent the script to save the best_weights.h5 file when save_best_only parameter is set to True ?

I'm running the script on Windows 10 and in a st_zoo virtual env as detailled in the repository README.

Here is my user_config.yaml file :

general:
  project_name: Cup_Detection
  model_type: ssd_mobilenet_v2_fpnlite
  model_path: ../pretrained_models/ssd_mobilenet_v2_fpnlite/ST_pretrainedmodel_public_dataset/coco_2017_person/ssd_mobilenet_v2_fpnlite_035_416/ssd_mobilenet_v2_fpnlite_035_416.h5 #../pretrained_models/ssd_mobilenet_v2_fpnlite/ST_pretrainedmodel_public_dataset/coco_2017_person/ssd_mobilenet_v2_fpnlite_035_416/ssd_mobilenet_v2_fpnlite_035_416_int8.tflite
  logs_dir: logs
  saved_models_dir: saved_models
  gpu_memory_limit: 16
  global_seed: 127

operation_mode: chain_tqe
#choices=['training' , 'evaluation', 'deployment', 'quantization', 'benchmarking',
#        'chain_tqeb','chain_tqe','chain_eqe','chain_qb','chain_eqeb','chain_qd ']

dataset:
  name: custom_cup_dataset
  class_names: [ cup ]
  training_path: ../datasets/cup_images_dataset/train
  validation_path: ../datasets/cup_images_dataset/val
  test_path: ../datasets/cup_images_dataset/test
  quantization_path:
  quantization_split: 0.3

preprocessing:
  rescaling: { scale: 1/127.5, offset: -1 }
  resizing:
    aspect_ratio: fit
    interpolation: nearest
  color_mode: rgb

data_augmentation:
  rotation: 30
  shearing: 15
  translation: 0.1
  vertical_flip: 0.5
  horizontal_flip: 0.2
  gaussian_blur: 3.0
  linear_contrast: [ 0.75, 1.5 ]

training:
  model:
    alpha: 0.35
    input_shape: (416, 416, 3)
    pretrained_weights: imagenet
  dropout:
  batch_size: 64
  epochs: 5000
  optimizer:
    Adam:
      learning_rate: 0.001
  callbacks:
    ReduceLROnPlateau:
      monitor: val_loss
      patience: 20
    EarlyStopping:
      monitor: val_loss
      patience: 40

postprocessing:
  confidence_thresh: 0.6
  NMS_thresh: 0.5
  IoU_eval_thresh: 0.3
  plot_metrics: True   # Plot precision versus recall curves. Default is False.
  max_detection_boxes: 10

quantization:
  quantizer: TFlite_converter
  quantization_type: PTQ
  quantization_input_type: float
  quantization_output_type: uint8
  export_dir: quantized_models

benchmarking:
  board: STM32H747I-DISCO

tools:
  stm32ai:
    version: 8.1.0
    optimization: balanced
    on_cloud: True
    path_to_stm32ai: C:/Users/<XXXXX>/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/<*.*.*>/Utilities/windows/stm32ai.exe
  path_to_cubeIDE: C:/ST/STM32CubeIDE_1.10.1/STM32CubeIDE/stm32cubeide.exe

deployment:
  c_project_path: ../../stm32ai_application_code/object_detection/
  IDE: GCC
  verbosity: 1 n
  hardware_setup:
    serie: STM32H7
    board: STM32H747I-DISCO

mlflow:
  uri: ./experiments_outputs/mlruns

hydra:
  run:
    dir: ./experiments_outputs/${now:%Y_%m_%d_%H_%M_%S}

RSERSTM commented 1 week ago

Hello Cypamigon,

After some investigations with the provided yaml file we couldn't replicate the issue regarding best_weights.h5 not being present in /experiments_outputs/"%Y_%m_%d_%H_%M_%S"/saved_models/ . Because you are on Windows maybe you forgot to change the 256 characters maximum path length. To change this you can follow instructions in the TIP section of the main README (at the end).

Thanks,

cypamigon commented 1 week ago

Thanks for your quick feedback.

Unfortunately, I've already enabled windows long path support. I've tried to change the output path but it behave the same.

RSERSTM commented 1 week ago

Ok, another explanation could be that the ssd_mobilenet_v2_fpnlite_035_416.h5 model we provide, trained on person detection kept the information about its previous training especially the best val_loss. And when you try to save the best_weights.h5 it does not save anything because the new val_loss of your training is higher then the best val_loss. If this is true a workaround could be -> for just 1 epoch put save_best_only=False then stop the training, use the best_weights.h5 of this training (best_weights.h5 in general.model_path section) to launch another training but with save_best_only=True this time.

Thanks,

cypamigon commented 1 week ago

Hmm, okay, looks promising. I'm currently running a training session with save_best_only=False. I'll try your solution once it finishes.

Thanks!

STMicroelectronics / stm32ai-modelzoo

Object detection - best weights never saved #39