Closed MadLittleMods closed 12 months ago
Playing around with this more, it seems like this is a side-effect of using the SoftMax
activation function as we only see cost gradient mismatch when using SoftMax
as the activation function on the output layer :thinking:
Is the SoftMax
implementation correct? It passes our slope check tests so the function and the derivative are at least correlated.
Fixed by https://github.com/MadLittleMods/zig-ocr-neural-network/pull/20 (see the PR description for more information on what was wrong)
(first explored in https://github.com/MadLittleMods/zig-ocr-neural-network/pull/1)
We can use
estimateCostGradientsForLayer(...)
which closely estimates the cost gradient (the numerical gradient) against what we actually calculate to be the cost gradient (the analytical gradient). They should match! This is called "gradient checking."To turn on gradient checking, set
should_gradient_check = true;
and adjust thetest_layer
insanityCheckGradients(...)
to what layer you want to compare.https://github.com/MadLittleMods/zig-ocr-neural-network/blob/7d865616b12a76ebfb8385815312915e80b82c0a/src/neural_networks/neural_networks.zig#L218-L222
As a note, the networks seem to make training progress on both the simple animal example and main MNIST OCR problem regardless of the suspicions below. It would be nice to figure out the root cause though.
Fishy gradients
Looking at the gradient checks, I see problems ...
When there are 2 labels, the values in the estimated cost gradients for the output layer are all 2x bigger than the cost gradients that we calculate with actual derivatives (regardless of the number of hidden layers) (happens with both
SquaredError
andCrossEntropy
cost functions). It doesn't happen when we use theSigmoid
activation function on all of the layers though.Having all values in the cost gradient scaled by some constant is fishy but seems like it would still represent the same direction so wouldn't hurt training. I'm not sure why I'm seeing the scaled difference in the first place though?
As soon as we add a 3rd label (we don't even add any data points with that 3rd label) which means 3 nodes in output layer, the ratios between estimated and actual cost gradients are not equal across the gradient (uneven) which seems like it would cause problems:
How do we know that
estimateCostGradientsForLayer(...)
is giving correct values?The implementation of
estimateCostGradientsForLayer(...)
could be wrong. That should be the first place to look but it does use a pretty simple concept(f(x + h) - f(x - h)) / 2h
and only relies on the cost function which is a lot less moving pieces that our cost gradient calculation using all of the derivatives.