allanzelener / YAD2K

YAD2K: Yet Another Darknet 2 Keras
Other
2.72k stars 881 forks source link

Loss computed by model.predict() does not match Keras training output reporting #146

Closed kechan closed 6 years ago

kechan commented 6 years ago

This concerns with running the train_overfit.py. I tried to run it with 1 epoch and add the following lines:

loss = model.predict([image_data, boxes, detectors_mask, matching_true_boxes]) print('loss = {}'.format(loss))

immediately after model.save_weights('overfit_weights.h5')

And here's the output:

Epoch 1/1 1/1 [==============================] - 13s 13s/step - loss: 145.3360 loss = [51752.61]

I thought the loss should be rough the same. Why is the model output (which is the loss) so much bigger than that reported by Keras training run? I must not be following something and need to correct my understanding of the code and framework.

(Note: i confirmed for simple loss such as categorical cross entropy, and doing 1 sample and 1 epoch, Keras seemed to report the loss on output before backprop+optimize, so the comparison with direct loss computation is off by 1 training step).

kechan commented 6 years ago

Update:

I also tried to run 100 epochs such that the loss from step to step is not dramatically reduced. And I got this:

Epoch 100/100 1/1 [==============================] - 0s 277ms/step - loss: 5.5116 loss = [10.545552]

Both losses (5.5 vs. 10.5) are too large for what I expect.

kechan commented 6 years ago

Update: I did more investigation and found out this may be "expected". Due to the presence of batch norm layers in the yolo model, calling .predict(...) and .fit(...) most likely have different execution paths during forward pass over batch norms (ie. how the "mean" and "variance" stored should be used or updated). Another consequential behavior is if you set the learning rate to zero, the output of the model given the same image_data will be different if you can a few model.fit(...). Batch norm's mean and variance probably still get updated.

This has been a good learning lesson for me personally. I created a very simple forward feed net and confirmed this behavior is there as well if you add a batch norm. If you take away the batch norm, then forward pass values (incl. the loss) will be exactly the same for .fit(...) or .predict(...)

The overfit is a very contrived setting and I just got sidetracked confusing myself over all these.

Closing this.