Open twangnh opened 6 years ago
I mean, first ./logs is empty, no checkpoint there
HI @MrWanter , the training will be done in four stages. in the first stage, it's true ./logs will have no checkpoint, On the other side, we will set self.fine_tune_vgg16 = False
Please check the "Instructions for running the scripts" section in repository readme for more details.
thank you, I missed the first part.
@MrWanter Have you understood the training order?From your dialogue, I run by this order: first:set self.fine_tune_vgg16 to False, self.train_dir = './logs/' self.checkpoint_path = './trained_models/vgg16/vgg_16.ckpt' I run it successfully and evaluate it with python3 evaluate_model.py. second:set self.fine_tune_vgg16 to True,and in if self.fine_tune_vgg16: self.train_dir = './logs/finetune' self.checkpoint_path = './logs' And the train_dir and checkpoint_path before if self.fine_tune_vgg16: is still like this: self.train_dir = './logs/' self.checkpoint_path = './trained_models/vgg16/vgg_16.ckpt' I didn't delete the result saved in logs,which I trained in first stage.Because I think the second stage need to use the checkpoint in it.Then I run train_model.py,getting new checkpoint in './logs/finetune/'. And then I run python3 evaluate_model.py -f to evaluate it.
hi,@LevinJ, I download vgg16.ckpt model as you suggested and run the train_model.py with below settings:
self.train_dir = './logs'
self.checkpoint_path = '../data/trained_models/vgg16/vgg_16.ckpt'
self.checkpoint_exclude_scopes = g_ssd_model.model_name
self.trainable_scopes = g_ssd_model.model_name
self.learning_rate = 0.001
self.learning_rate_decay_type = 'fixed'
self.fine_tune_vgg16 = False
if self.fine_tune_vgg16:
.........
but I get error message:
[[Node: train_op/CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan", _device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[Node: Adam/update_ssd_300_vgg/block9_box/conv_loc/biases/ApplyAdam/_1948 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2719_Adam/update_ssd_300_vgg/block9_box/conv_loc/biases/ApplyAdam", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0
when I set self.fine_tune_vgg16 to 'True' as below: self.fine_tune_vgg16 = True
if self.fine_tune_vgg16:
#fine tune all parameters
self.train_dir = './logs/finetune'
self.checkpoint_path = './logs'
self.checkpoint_exclude_scopes = None
self.trainable_scopes = "{},vgg_16".format(g_ssd_model.model_name)
self.max_number_of_steps = 130000
I get the error message as beolow: NotFoundError (see above for traceback): Tensor name "ssd_300_vgg/BatchNorm/moving_mean" not found in checkpoint files ../data/trained_models/vgg16/vgg_16.ckpt
@bridgeZhang Make sure that the path to vgg_16.ckpt is right.
@Janezzliu thank you for your answer. Yes, the path to vgg_16.ckpt is right, if I delete vgg_16.ckpt, the error messge changes: Failed to find any matching files for ../data/trained_models/vgg16/vgg_16.ckpt
Hi! Levin, thanks for your work, I have a confusion about the two lines about
self.checkpoint_path
:https://github.com/LevinJ/SSD_tensorflow_VOC/blob/6ff0eec7b3428bc6d36afadf3d7acf2fd426aec9/train_model.py#L398
then https://github.com/LevinJ/SSD_tensorflow_VOC/blob/6ff0eec7b3428bc6d36afadf3d7acf2fd426aec9/train_model.py#L418
the second will overwrite the first one, so how does it find
vgg16.ckpt
?