Open StefanCardnell opened 5 years ago
Hi @StefanCardnell , Thanks for your very kind words, I'm really glad you enjoy my book! :) Your code looks good, but there might be a couple issues:
softmax()
function from scipy.special
. I had to replace softmax(logits)
with softmax(logits, axis=1)
, or else it would perform softmax across the whole array instead of applying it independently for each instance. Perhaps that's the issue you're running into? It would be great to copy/paste the full code, including how you load the data and how you define the softmax()
function (ideally, make a runnable example in a Colab and share the link to it).eta
may be too high, try turning it down to 0.01.Theta
and roll back to it after the validation loss has not improved for a long time.Regarding l2 regularization, it can make the model generalize better (depending on the regularization hyperparameter). When it is the case, training may take a bit longer (for a fixed learning rate) and reach a lower validation loss. So it should take a bit longer before early stopping.
Hope this helps!
Hi. Handson-ml is a fantastic book and I'm enjoying it so far.
I was having a crack at Exercise 12 of Chapter 4 and implemented my own BGD for Softmax Regression with early stopping. I took it quite literally and applied only early stopping with no l_2 regularisation, in contrast to the online answer which does. I was going a bit crazy though, because it seemed that it never performed an early stop! Even after 50000 iterations with an eta of 0.1, which suggests the validation error is always getting better.
I checked a hundred times and couldn't see an issue with my own implementation. As a last verify I took the answer from this repo but took out the l_2 regularization
I was thankful to find that this also never performed an early stop! I'm certainly more confused though. Typically l_2 regularization is used to help prevent overfitting to the training data and I would expect it to help go further in reduce the validation error.
Without l_2, I was expecting that the Gradient Vector calculation at each iteration would "overfit" the training data - at which point the validation error would rise and early stopping would occur ... but that never happened! Is there any insight in to why? Or am I slightly confused about something here?