blocked with mini-batch gradient descent

benech17 commented 5 years ago

Hi ! I'm a beginning in Machine Learning thanks to your book and thanks for all , it's very useful ! But , I'm blocked with mini-batch gradient descent ! In the book there is a function :

  def fetch_batch(epoch, batch_index, batch_size) : 
       [...] #Load data from the disk (see notebook)
       return X_batch , Y_batch

But I didn't found It ! Have you an idea what should I write into this function ?

ageron commented 5 years ago

Hi @benech17 , thanks for your kind words, and for your question. Did you find the answer? Where did you find this code? I can't find it in chapter 4.

The fetch_data() function should fetch some data from the disk and prepare it for the training algorithm. This is necessary when the data does not fit in memory. If it does, then you can simply write this:

import numpy as np
import time

# Here, I generate random data, but of course, you should instead load and prepare
# the data you really care about, by following the steps in chapter 2.
X_train, y_train = np.random.rand(2, 1000)

def sample_batch(X, y, batch_size):
    indices = np.random.randint(len(X), size=batch_size)
    return X[indices], y[indices]

n_iterations = 1000
batch_size = 32
for iteration in range(n_iterations):
    X_batch, y_batch = sample_batch(X_train, y_train, batch_size)
    time.sleep(0.01) # instead, perform a training step here
    print("\r{}/{}".format(iteration + 1, n_iterations), end="")

This depends on the task.

ageron commented 5 years ago

If the data is huge and does not fit in memory, you probably want to split it into many files, shuffle the files, then read multiple lines from multiple files, shuffle them, preprocess them and batch them. That's not trivial, but fortunately, TensorFlow's Data API makes it fairly easy. Check out chapter 13 in the 2nd edition. The early release is available here, it requires signing up to the Safari platform, they have a free trial if you want). You can also just check out the notebook here.

Hope this helps.

benech17 commented 5 years ago

This code is in Chapter 2 of the book : "Deep Learning with Tensorflow , implemented and concret case " (maybe not exactly , I'm on the french version , the one with a fish on the cover ", page 59. (not the 2nd edition of "Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow "obviously) Yes , I closed the issue because I found an answer :

def fetch_batch(epoch, batch_index, batch_size):
    np.random.seed(epoch * n_batches + batch_index)
    indices = np.random.randint(m, size=batch_size)
    x_batch = scaled_housing_data_plus_bias[indices]
    y_batch = housing.target.reshape(-1, 1)[indices]
    return x_batch, y_batch

It worked perfectly as my MSE converge to , approximately , 0.5244 to the nearest hundredths.

Here's my full code , the gradient descent using mini-batch ( that works far better than the "manual way " ) , if you got any suggestion :

import numpy as np
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_california_housing

tf.logging.set_verbosity(tf.logging.ERROR)  # NON PLUS TURRA

# Loading the data
housing = fetch_california_housing()
m, n = housing.data.shape

# Dividing the set into 100-items batches
batch_size = 100
n_batches = int(np.ceil(m / batch_size))
n_epochs = 1000
learning_rate = 1e-4

scaler = StandardScaler()
scaled_housing_data = scaler.fit_transform(housing.data)
scaled_housing_data_plus_bias = np.c_[np.ones((m, 1)), scaled_housing_data]

# Function to generate the batches

def fetch_batch(epoch, batch_index, batch_size):
    np.random.seed(epoch * n_batches + batch_index)
    indices = np.random.randint(m, size=batch_size)
    x_batch = scaled_housing_data_plus_bias[indices]
    y_batch = housing.target.reshape(-1, 1)[indices]
    return x_batch, y_batch

# Using Placeholders
X = tf.placeholder(tf.float32, shape=(None, n + 1), name="X")
y = tf.placeholder(tf.float32, shape=(None, 1), name="y")

theta = tf.Variable(tf.random_uniform(
    [n + 1, 1], -1.0, 1.0, seed=42), name="theta")
y_predicted = tf.matmul(X, theta, name="predictions")
error = y_predicted - y

mse = tf.reduce_mean(tf.square(error), name="mse")
# USING THE TF GRADIENT DESCENT OPTIMIZER
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
traninig_op = optimizer.minimize(mse)

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        for batch_index in range(n_batches):
            x_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
            mse_value, _ = sess.run([mse, traninig_op],
                                    feed_dict={X: x_batch, y: y_batch})
        mse_value = sess.run(mse, feed_dict={
            X: scaled_housing_data_plus_bias, y: housing.target.reshape(-1, 1)})
        print(f"Epoch {epoch}, MSE: {mse_value}")
    best_theta = theta.eval()
    print(f"Best theta: {best_theta}")

And my result :

Epoch 991, MSE: 0.5246286988258362
Epoch 992, MSE: 0.5245724320411682
Epoch 993, MSE: 0.5245333909988403
Epoch 994, MSE: 0.5245057344436646
Epoch 995, MSE: 0.5244897603988647
Epoch 996, MSE: 0.5244860053062439
Epoch 997, MSE: 0.5244871973991394
Epoch 998, MSE: 0.5244867205619812
Epoch 999, MSE: 0.5244981646537781
Best theta: [[ 2.0684946 ]
 [ 0.8162446 ]
 [ 0.12004736]
 [-0.23249783]
 [ 0.27134204]
 [-0.00344482]
 [-0.04054145]
 [-0.902139  ]
 [-0.8718968 ]]

vishalML commented 5 years ago

for epoch in range(n_epochs):
        for batch_index in range(n_batches):
            x_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
            mse_value, _ = sess.run([mse, traninig_op],
                                    feed_dict={X: x_batch, y: y_batch})
        mse_value = sess.run(mse, feed_dict={
            X: scaled_housing_data_plus_bias, y: housing.target.reshape(-1, 1)})``

Can you tell me why is mse_value calculated twice?

ageron commented 5 years ago

Hi @vishalML , Indeed, the line mse_value, _ = sess.run([mse, training_op],...) could be replaced with _ = sess.run(training_op,...) since the mse_value is not used. However, we could instead add a line just after that to print the training batch error (I also fixed the traninig typo and renamed the mse_value variables to highlight the fact that we're estimating two different errors, the first is the training batch error, and the second is the error on the full training set):

for epoch in range(n_epochs):
        for batch_index in range(n_batches):
            x_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
            batch_mse_value, _ = sess.run([mse, training_op],
                                          feed_dict={X: x_batch, y: y_batch})
            print("\r", batch_mse_value, end="")
        print()
        full_training_set_mse_value = sess.run(mse, feed_dict={
            X: scaled_housing_data_plus_bias, y: housing.target.reshape(-1, 1)})``

vishalML commented 5 years ago

Thanks @ageron According to the above code, we're optimizing the batch_mse_value and not the full_training set_mse_value. So I am getting a little confused here. Also printing the batch_mse_value for every batch will take more and more time. Can you please give a brief explanation of why we're optimizing the batch_mse_value and not the full_training_set_mse_value and should we print batch_mse_value for every batch? Please excuse me if my doubts are very childish but I am really confused.

ageron commented 5 years ago

Hi @vishalML ,

No worries, it's always good to ask, and I'm happy to help. :)

The inner for loop (the one with the batch_index) trains the model one batch at a time for one epoch. We're doing mini-batch gradient descent. At each iteration, we randomly sample a batch of 100 instances from the training set (since batch_size=100) and we run the training_op to perform a gradient descent step, passing TensorFlow the batch of data using the feed_dict. This gradient descent step slightly improves the parameters of the model. The training_op actually depends on computing the gradients of the MSE over the batch, so asking TensorFlow to return the value of this batch MSE does not cost us any measurable performance. However, displaying it at each training step does have a slight cost, as you correctly point out. We could instead display it only every 10 or 100 steps.

When the for loop is finished, we have gone through roughly the same number of instances as there are in the training set (since batches are sampled randomly, it's possible that some instances were used multiple times, and others not at all, but that's fine). This is the end of one epoch. The outer for loop runs this whole process 1,000 times (since n_epochs=1000). Each time we finish an epoch, we compute the MSE, but this time we compute it over the full training set. Since the model parameters keep improving at each iteration step, the model should get better and better, and we should see the full training set error go down.

In other words: the model only sees 100 instances at a time, at each training step, but it allows it to gradually improve, so when we regularly measure the performance of the mode over the full training set, we should see the error go down as well. We're improving the model gradually, so both the batch MSE and the full training set MSE should go down. However, note that the batch error will usually vary much more than the overall training error. It's like measuring the average weight of 5 random people versus the average weight of one million random people: the former will vary much more than the latter.

Oh and I forgot to print full_training_set_mse_value after computing it (there's no point in computing it if we do nothing with it).

Hope this all makes sense.

vishalML commented 5 years ago

Thanks @ageron for the indepth answer. Your book helped me grasp more on machine learning algorithms and sci-kit learn library and I know it will also help me to learn deep learning. Waiting for the new edition to complete and thankyou again.

ageron / handson-ml

blocked with mini-batch gradient descent #438