self.checkpoint_path confusion

twangnh commented 6 years ago

Hi! Levin, thanks for your work, I have a confusion about the two lines aboutself.checkpoint_path:

https://github.com/LevinJ/SSD_tensorflow_VOC/blob/6ff0eec7b3428bc6d36afadf3d7acf2fd426aec9/train_model.py#L398

then https://github.com/LevinJ/SSD_tensorflow_VOC/blob/6ff0eec7b3428bc6d36afadf3d7acf2fd426aec9/train_model.py#L418

the second will overwrite the first one, so how does it find vgg16.ckpt?

twangnh commented 6 years ago

I mean, first ./logs is empty, no checkpoint there

LevinJ commented 6 years ago

HI @MrWanter , the training will be done in four stages. in the first stage, it's true ./logs will have no checkpoint, On the other side, we will set self.fine_tune_vgg16 = False

Please check the "Instructions for running the scripts" section in repository readme for more details.

twangnh commented 6 years ago

thank you, I missed the first part.

Janezzliu commented 6 years ago

@MrWanter Have you understood the training order?From your dialogue, I run by this order: first:set self.fine_tune_vgg16 to False, self.train_dir = './logs/' self.checkpoint_path = './trained_models/vgg16/vgg_16.ckpt' I run it successfully and evaluate it with python3 evaluate_model.py. second:set self.fine_tune_vgg16 to True,and in if self.fine_tune_vgg16: self.train_dir = './logs/finetune' self.checkpoint_path = './logs' And the train_dir and checkpoint_path before if self.fine_tune_vgg16: is still like this: self.train_dir = './logs/' self.checkpoint_path = './trained_models/vgg16/vgg_16.ckpt' I didn't delete the result saved in logs,which I trained in first stage.Because I think the second stage need to use the checkpoint in it.Then I run train_model.py,getting new checkpoint in './logs/finetune/'. And then I run python3 evaluate_model.py -f to evaluate it.

bridgeZhang commented 6 years ago

hi,@LevinJ, I download vgg16.ckpt model as you suggested and run the train_model.py with below settings:

    self.train_dir = './logs'           
    self.checkpoint_path =  '../data/trained_models/vgg16/vgg_16.ckpt' 
    self.checkpoint_exclude_scopes = g_ssd_model.model_name
    self.trainable_scopes = g_ssd_model.model_name        

    self.learning_rate = 0.001
    self.learning_rate_decay_type = 'fixed'              

    self.fine_tune_vgg16 = False

    if self.fine_tune_vgg16:
      .........

but I get error message:

[[Node: train_op/CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan", _device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[Node: Adam/update_ssd_300_vgg/block9_box/conv_loc/biases/ApplyAdam/_1948 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2719_Adam/update_ssd_300_vgg/block9_box/conv_loc/biases/ApplyAdam", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0

bridgeZhang commented 6 years ago

when I set self.fine_tune_vgg16 to 'True' as below: self.fine_tune_vgg16 = True

    if self.fine_tune_vgg16:  
        #fine tune all parameters
        self.train_dir = './logs/finetune'
        self.checkpoint_path =  './logs'
        self.checkpoint_exclude_scopes = None
        self.trainable_scopes = "{},vgg_16".format(g_ssd_model.model_name)
        self.max_number_of_steps = 130000

I get the error message as beolow: NotFoundError (see above for traceback): Tensor name "ssd_300_vgg/BatchNorm/moving_mean" not found in checkpoint files ../data/trained_models/vgg16/vgg_16.ckpt

Janezzliu commented 6 years ago

@bridgeZhang Make sure that the path to vgg_16.ckpt is right.

bridgeZhang commented 6 years ago

@Janezzliu thank you for your answer. Yes, the path to vgg_16.ckpt is right, if I delete vgg_16.ckpt, the error messge changes: Failed to find any matching files for ../data/trained_models/vgg16/vgg_16.ckpt

LevinJ / SSD_tensorflow_VOC

self.checkpoint_path confusion #22