Estimated numerical cost gradient doesn't match what we're calculating (gradient checks)

MadLittleMods commented 1 year ago

(first explored in https://github.com/MadLittleMods/zig-ocr-neural-network/pull/1)

We can use estimateCostGradientsForLayer(...) which closely estimates the cost gradient (the numerical gradient) against what we actually calculate to be the cost gradient (the analytical gradient). They should match! This is called "gradient checking."

To turn on gradient checking, set should_gradient_check = true; and adjust the test_layer in sanityCheckGradients(...) to what layer you want to compare.

https://github.com/MadLittleMods/zig-ocr-neural-network/blob/7d865616b12a76ebfb8385815312915e80b82c0a/src/neural_networks/neural_networks.zig#L218-L222

As a note, the networks seem to make training progress on both the simple animal example and main MNIST OCR problem regardless of the suspicions below. It would be nice to figure out the root cause though.

Fishy gradients

Looking at the gradient checks, I see problems ...

When there are 2 labels, the values in the estimated cost gradients for the output layer are all 2x bigger than the cost gradients that we calculate with actual derivatives (regardless of the number of hidden layers) (happens with both SquaredError and CrossEntropy cost functions). It doesn't happen when we use the Sigmoid activation function on all of the layers though.

$ zig build run-simple_xy_animal_sample
debug: layer 0 weights { 0.3251169104168574, 1.0577112895890197, 0.36170159819321346, 0.014072284781826894 }
debug: layer 0 biases { 0.1, 0.1 }
debug: updateCostGradients: outputs: { 0.5348035822381325, 0.4651964177618675 } output_sum =1
error: Relative error for index 0 in weight gradient was too high (0.4999999995803624).
error: Relative error for index 1 in weight gradient was too high (0.49999999998464867).
error: Relative error for index 2 in weight gradient was too high (0.4999999995797981).
error: Relative error for index 3 in weight gradient was too high (0.4999999999846487).
error: Relative error for index 0 in bias gradient was too high (0.4999999995085449).
error: Relative error for index 1 in bias gradient was too high (0.4999999995080235).
warning: The first relative error we found is 0.4999999995803624 (should be ~0 which indicates the estimated and actual gradients match) which means our actual cost gradient values are some multiple of the estimated weight gradient. The relative error is the same across the entire gradient so even though the actual value is differnt that the estimated value, it doesn't affect the direction of the gradient or accuracy of the gradient descent step but may indicate some slight problem.
    Estimated weight gradient: { 0.245882, 0.044174, -0.245882, -0.044174 }
       Actual weight gradient: { 0.122941, 0.022087, -0.122941, -0.022087 }
    Estimated bias gradient: { 0.266106, -0.266106 }
       Actual bias gradient: { 0.133053, -0.133053 }
debug: epoch 0     batch 0               -> cost 9.63600472594928, accuracy with testing points 0.5428571428571428
...

Having all values in the cost gradient scaled by some constant is fishy but seems like it would still represent the same direction so wouldn't hurt training. I'm not sure why I'm seeing the scaled difference in the first place though?

As soon as we add a 3rd label (we don't even add any data points with that 3rd label) which means 3 nodes in output layer, the ratios between estimated and actual cost gradients are not equal across the gradient (uneven) which seems like it would cause problems:

$ zig build run-simple_xy_animal_sample
debug: layer 0 weights { 0.3251169104168574, 1.0577112895890197, 0.36170159819321346, 0.014072284781826894, -1.271551261654049, 0.32522080597322767 }
debug: layer 0 biases { 0.1, 0.1, 0.1 }
debug: updateCostGradients: outputs: { 0.48254030802695685, 0.4197354508741787, 0.09772424109886445 } output_sum =1
error: Relative error for index 0 in weight gradient was too high (0.483783857813716).
error: Relative error for index 1 in weight gradient was too high (0.4837838581373911).
error: Relative error for index 2 in weight gradient was too high (0.41857069101310707).
error: Relative error for index 3 in weight gradient was too high (0.41857069154371923).
error: Relative error for index 4 in weight gradient was too high (0.10833061748674454).
error: Relative error for index 5 in weight gradient was too high (0.10833061383448027).
error: Relative error for index 0 in bias gradient was too high (0.4837838577582029).
error: Relative error for index 1 in bias gradient was too high (0.41857069091976157).
error: Relative error for index 2 in bias gradient was too high (0.10833061820704716).
warning: The first relative error we found is 0.483783857813716 (should be ~0 which indicates the estimated and actual gradients match) which means our actual cost gradient values are some multiple of the estimated weight gradient. The relative error is the different across the entire gradient which means the gradient is pointing in a totally different direction than it should. Our backpropagation algorithm is probably wrong.
    Estimated weight gradient: { 0.215667, 0.038745, -0.224596, -0.040350, 0.008929, 0.001604 }
       Actual weight gradient: { 0.111331, 0.020001, -0.130587, -0.023460, 0.007962, 0.001430 }
    Estimated bias gradient: { 0.233406, -0.243070, 0.009664 }
       Actual bias gradient: { 0.120488, -0.141328, 0.008617 }
error: Relative error in cost gradients was too high meaning that some values in the estimated vs actual cost gradients were too different which means our backpropagation algorithm is probably wrong and we're probably stepping in an arbitrarily wrong direction.
    Estimated weight gradient: { 0.215667, 0.038745, -0.224596, -0.040350, 0.008929, 0.001604 }
       Actual weight gradient: { 0.111331, 0.020001, -0.130587, -0.023460, 0.007962, 0.001430 }
    Estimated bias gradient: { 0.233406, -0.243070, 0.009664 }
       Actual bias gradient: { 0.120488, -0.141328, 0.008617 }
error: RelativeErrorTooHigh

How do we know that `estimateCostGradientsForLayer(...)` is giving correct values?

The implementation of estimateCostGradientsForLayer(...) could be wrong. That should be the first place to look but it does use a pretty simple concept (f(x + h) - f(x - h)) / 2h and only relies on the cost function which is a lot less moving pieces that our cost gradient calculation using all of the derivatives.

MadLittleMods commented 1 year ago

Playing around with this more, it seems like this is a side-effect of using the SoftMax activation function as we only see cost gradient mismatch when using SoftMax as the activation function on the output layer :thinking:

Is the SoftMax implementation correct? It passes our slope check tests so the function and the derivative are at least correlated.

MadLittleMods commented 12 months ago

Fixed by https://github.com/MadLittleMods/zig-ocr-neural-network/pull/20 (see the PR description for more information on what was wrong)

MadLittleMods / zig-ocr-neural-network