ageron / handson-ml

⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.
Apache License 2.0
25.22k stars 12.91k forks source link

Chapter 4 Exercise 12 - Early stopping without l2 regularisation never stops. #518

Open StefanCardnell opened 5 years ago

StefanCardnell commented 5 years ago

Hi. Handson-ml is a fantastic book and I'm enjoying it so far.

I was having a crack at Exercise 12 of Chapter 4 and implemented my own BGD for Softmax Regression with early stopping. I took it quite literally and applied only early stopping with no l_2 regularisation, in contrast to the online answer which does. I was going a bit crazy though, because it seemed that it never performed an early stop! Even after 50000 iterations with an eta of 0.1, which suggests the validation error is always getting better.

I checked a hundred times and couldn't see an issue with my own implementation. As a last verify I took the answer from this repo but took out the l_2 regularization

eta = 0.1
n_iterations = 50000
m = len(X_train)
epsilon = 1e-7
alpha = 0.1  # regularization hyperparameter
best_loss = np.infty

Theta = np.random.randn(n_inputs, n_outputs)

for iteration in range(n_iterations):
    logits = X_train.dot(Theta)
    Y_proba = softmax(logits)
    error = Y_proba - Y_train_one_hot
    gradients = 1/m * X_train.T.dot(error)
    Theta = Theta - eta * gradients

    logits = X_valid.dot(Theta)
    Y_proba = softmax(logits)
    loss = -np.mean(np.sum(Y_valid_one_hot * np.log(Y_proba + epsilon), axis=1))
    if iteration % 500 == 0:
        print(iteration, loss)
    if loss <= best_loss:
        best_loss = loss
    else:
        print(iteration - 1, best_loss)
        print(iteration, loss, "early stopping!")
        break

I was thankful to find that this also never performed an early stop! I'm certainly more confused though. Typically l_2 regularization is used to help prevent overfitting to the training data and I would expect it to help go further in reduce the validation error.

Without l_2, I was expecting that the Gradient Vector calculation at each iteration would "overfit" the training data - at which point the validation error would rise and early stopping would occur ... but that never happened! Is there any insight in to why? Or am I slightly confused about something here?

ageron commented 5 years ago

Hi @StefanCardnell , Thanks for your very kind words, I'm really glad you enjoy my book! :) Your code looks good, but there might be a couple issues:

Regarding l2 regularization, it can make the model generalize better (depending on the regularization hyperparameter). When it is the case, training may take a bit longer (for a fixed learning rate) and reach a lower validation loss. So it should take a bit longer before early stopping.

Hope this helps!