Chapter 4 Exercise 12 - Early stopping without l2 regularisation never stops.

Hi. Handson-ml is a fantastic book and I'm enjoying it so far.

I was having a crack at Exercise 12 of Chapter 4 and implemented my own BGD for Softmax Regression with early stopping. I took it quite literally and applied only early stopping with no l_2 regularisation, in contrast to the online answer which does. I was going a bit crazy though, because it seemed that it never performed an early stop! Even after 50000 iterations with an eta of 0.1, which suggests the validation error is always getting better.

I checked a hundred times and couldn't see an issue with my own implementation. As a last verify I took the answer from this repo but took out the l_2 regularization

eta = 0.1
n_iterations = 50000
m = len(X_train)
epsilon = 1e-7
alpha = 0.1  # regularization hyperparameter
best_loss = np.infty

Theta = np.random.randn(n_inputs, n_outputs)

for iteration in range(n_iterations):
    logits = X_train.dot(Theta)
    Y_proba = softmax(logits)
    error = Y_proba - Y_train_one_hot
    gradients = 1/m * X_train.T.dot(error)
    Theta = Theta - eta * gradients

    logits = X_valid.dot(Theta)
    Y_proba = softmax(logits)
    loss = -np.mean(np.sum(Y_valid_one_hot * np.log(Y_proba + epsilon), axis=1))
    if iteration % 500 == 0:
        print(iteration, loss)
    if loss <= best_loss:
        best_loss = loss
    else:
        print(iteration - 1, best_loss)
        print(iteration, loss, "early stopping!")
        break

I was thankful to find that this also never performed an early stop! I'm certainly more confused though. Typically l_2 regularization is used to help prevent overfitting to the training data and I would expect it to help go further in reduce the validation error.

Without l_2, I was expecting that the Gradient Vector calculation at each iteration would "overfit" the training data - at which point the validation error would rise and early stopping would occur ... but that never happened! Is there any insight in to why? Or am I slightly confused about something here?

Hi @StefanCardnell , Thanks for your very kind words, I'm really glad you enjoy my book! :) Your code looks good, but there might be a couple issues:

I ran your code on the MNIST dataset, and I used the softmax() function from scipy.special. I had to replace softmax(logits) with softmax(logits, axis=1), or else it would perform softmax across the whole array instead of applying it independently for each instance. Perhaps that's the issue you're running into? It would be great to copy/paste the full code, including how you load the data and how you define the softmax() function (ideally, make a runnable example in a Colab and share the link to it).
For better performance you may want to evaluate the model on the validation set only every few iterations.
The learning rate eta may be too high, try turning it down to 0.01.
You may want to interrupt training only after the validation loss stops improving for a few iterations, not immediately. Another way to implement early stopping is to save the best Theta and roll back to it after the validation loss has not improved for a long time.

Regarding l2 regularization, it can make the model generalize better (depending on the regularization hyperparameter). When it is the case, training may take a bit longer (for a fixed learning rate) and reach a lower validation loss. So it should take a bit longer before early stopping.

Hope this helps!

ageron / handson-ml

Chapter 4 Exercise 12 - Early stopping without l2 regularisation never stops. #518