chapter 3: display of digits from MNIST dataset.

aishwaryashinde6 commented 5 years ago

def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = mpl.cm.binary, **options)
    plt.axis("off")

I can't understand the code, It would be a great help if you could simplify it.

ageron commented 5 years ago

Hi @aishwaryashinde6 ,

I did not really intend this code to be exposed, it's really just there to generate the figures in the book. What it does is generate and display a single image representing a grid of digit images. If you give it 23 images (indexed from 0 to 22), and ask for 10 images per row, the final image will contain 3 rows of digits (first row = images 0 to 9, second row = images 10 to 19 and third row = images 20 to 22), and since the final row is shorter than the others, the function would add 7 empty images (which is why the function appends np.zeros() at one point. That about all there is to know, the rest should be fairly self-explanatory.

It would probably be simpler to use a grid of subplots, but I think I ran into some issues with the spacing between the subplots, or something like that.

Hope this helps

Rahul954 commented 5 years ago

Hello Aishwarya,

Thanks for raising it out.

@ageron - I do value your thoughts that this code wasn't mean to be exposed. But what I like about your book is that it's kind of a Bible of each word I want to learn, want to code and fit in my mind. I really aspire the day comes when I will finish this book with everything in my mind.

I also get stuck on this code. Please can explain in more detail i.e. I want to understand each line of it and I also got stuck at "exampleimages = np.r[X[:12000:600], X[13000:30600:600], X[30600:60000:590]]"

How do you know about these values and decided them.

I have high hopes from this book and you. Please can help in explaining each line of code and the line above. I don't want to lose the motivation of learning from this book.

I am really sorry if that would add some work to you.

Regards, Rahul

ageron commented 5 years ago

Hi Rahul!

Thanks for your kind words. :)

The MNIST dataset loaded by the old fetch_mldata() function was sorted by label, so since there are 60,000 images in the training set and 10 classes, it means that the first ~6000 images were 0s, then the next ~6000 images were 1s, and so on. If there were exactly 6,000 images of each digit, I could have used: example_images = X[::600] (this means take one image every 600 images), and I would have had exactly 10 images of each digit. Unfortunately, there are not exactly 600 images per digit, so I had to manually pick ranges of values that worked well to get exactly 10 digits per class: X[:12000:600] means "take one image out of 600 from the first image to the 11,999th". Then np.r_[a,b,c] means "concatenate a, b and c along the vertical axis". The "r" stands for "row". FYI you can also use np.c_[a,b,c] to concatenate along the horizontal axis ("c" stands for "column").

Hope this helps, Aurélien

ps: fetch_mldata() is gone, since mldata.org is dead, so now we have to use fetch_openml(), but it does not sort the images by label anymore. So to ensure that the notebook kept working like before, I had to sort the images by label myself (just after using fetch_openml()).

KarthikMamudur commented 5 years ago

Hi Aurélien,

Thank you for the amazing book. I am enjoying every page of it. Best part is the language is such that I feel you talking to me as I read. Thank you!!!. Have a simple question though. I was trying to reproduce results in Chapter 3 and faced this issue with loading the "fetch_mldata('MNIST Original')" command, then I saw your note in GitHub about a workaround and could load the data using "mnist = fetch_openml('mnist_784', version=1, cache=True)". But I get slightly different results. Here are my observations. I am assuming that this is because of some kind of update to the dataset not sure though, need your conformation regarding the same. I thought it might help others if atall they face a similar issue.

x[36000] is now digit 9 unlike digit 5 as mentioned in your book
cross_val_scores are now array([0.96615169, 0.9655 , 0.96094805]) unlike array([0.9502, 0.9656 , 0.96494]) not very different though
Confusion matrix is array([[54073, 505],[ 1643, 3779]]) , unlike array([[53272, 1307],[ 1077, 4344]]) I checked my code to the best of my knowledge, if this is solely due to the update in the dataset please let me know, kindly educate me if I am missing something. I am attaching my juypternotebook herewith.

Thank you in Advance, Regards, Karthik

Chapter_3_homl.ipynb.zip

ageron commented 5 years ago

Hi @KarthikMamudur ,

Thanks a lot for your kind words! :)

Indeed, the fetch_openml() function returns a slightly different version of the MNIST dataset:

images are not sorted by label: the dataset returned by fetch_mldata() was sorted by label in the training set (first 60,000 instances) and in the test set (last 10,000 instances). To get the exact same result as before, you need to sort the instances the same way (see the code below).
the labels are represented as strings (instead of unsigned 8-bit integers), so we need to cast them.

def sort_by_target(mnist):
    reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1]
    reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1]
    mnist.data[:60000] = mnist.data[reorder_train]
    mnist.target[:60000] = mnist.target[reorder_train]
    mnist.data[60000:] = mnist.data[reorder_test + 60000]
    mnist.target[60000:] = mnist.target[reorder_test + 60000]

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, cache=True)
mnist.target = mnist.target.astype(np.int8) # fetch_openml() returns targets as strings
sort_by_target(mnist) # fetch_openml() returns an unsorted dataset

This is shown in the Jupyter notebook. I recommend you follow along using the Jupyter notebooks, as they contain a few comments like this, when the behavior of Scikit-Learn (or other libraries) changed since I wrote the book.

Hope this helps, and I hope you keep enjoying the book! Cheers, Aurélien

qy-yang commented 5 years ago

Hi @aishwaryashinde6 ,

I did not really intend this code to be exposed, it's really just there to generate the figures in the book. What it does is generate and display a single image representing a grid of digit images. If you give it 23 images (indexed from 0 to 22), and ask for 10 images per row, the final image will contain 3 rows of digits (first row = images 0 to 9, second row = images 10 to 19 and third row = images 20 to 22), and since the final row is shorter than the others, the function would add 7 empty images (which is why the function appends np.zeros() at one point. That about all there is to know, the rest should be fairly self-explanatory.

It would probably be simpler to use a grid of subplots, but I think I ran into some issues with the spacing between the subplots, or something like that.

Hope this helps

Hi @ageron ,

Thanks for the amazing book. I enjoy to read it quite much. For this piece of code, may I clarify why you only append one empty np.zero array images.append(np.zeros([size, size * n_empty])) to the list images, but it does not throw index out of range error in the loop rimages = images[row * images_per_row : (row + 1) * images_per_row]?

Thank you in advance, Regards, QY

ageron commented 5 years ago

Hi @qy-yang , Good question! There's no out of range error when you try to get a range instead of a specific index:

>>> a=list(range(10))
>>> a
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> a[5:100]
[5, 6, 7, 8, 9]
>>> a[1000:2000]
[]
>>> a[1000]
[...]
IndexError: list index out of range

Hope this helps!

KarthikMamudur commented 5 years ago

Hi @KarthikMamudur ,

Thanks a lot for your kind words! :)

Indeed, the fetch_openml() function returns a slightly different version of the MNIST dataset:

images are not sorted by label: the dataset returned by fetch_mldata() was sorted by label in the training set (first 60,000 instances) and in the test set (last 10,000 instances). To get the exact same result as before, you need to sort the instances the same way (see the code below).

the labels are represented as strings (instead of unsigned 8-bit integers), so we need to cast them.
def sort_by_target(mnist):
    reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1]
    reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1]
    mnist.data[:60000] = mnist.data[reorder_train]
    mnist.target[:60000] = mnist.target[reorder_train]
    mnist.data[60000:] = mnist.data[reorder_test + 60000]
    mnist.target[60000:] = mnist.target[reorder_test + 60000]

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, cache=True)
mnist.target = mnist.target.astype(np.int8) # fetch_openml() returns targets as strings
sort_by_target(mnist) # fetch_openml() returns an unsorted dataset
This is shown in the Jupyter notebook. I recommend you follow along using the Jupyter notebooks, as they contain a few comments like this, when the behavior of Scikit-Learn (or other libraries) changed since I wrote the book.

Hope this helps, and I hope you keep enjoying the book! Cheers, Aurélien

Hi Aurélien,

This book is so interesting that I could not even reply back to you in time and thank you. I however have another question. This is regarding Gradient Boosting that you presented in Chapter-7, I did not see any gitHub page documenting the Chapter-7 stuff and hence I am posting my question in this page.

Intro to the Question:

I have trained a Gradient Boosting Regressor(GBR)

"GradientBoostingRegressor(max_depth=2,n_estimators=18,learning_rate=1.591)"

on a certain train set(4 features and 391 instances) and it gave me the best r2_scores compared to other methods, but this GBR falls short slightly compared to a Neural Network model that someone else trained on the same dataset. I want to live with my GBR model because for me visualizing the decision tree in a .png format helps compared to a black box ANN model result. I tried to export the tree from the GBR using export_graphviz.

"exportgraphviz(grbt.estimators[0][0],out_file="tree2.dot",feature_names=x_train.columns,filled=True,rounded=True)"

Question:1 But I am not sure which is the best tree among the 18 trees trained, that gives me a good visual representation of the DecisionTree model. As mentioned the r2_scores looks very good at 0.91.

Question:2 If i randomly choose a tree number and export it to png format, the magnitude of value attribute in each leaf does not seem to makes sense , i mean sometime it is even -ve (eg. -0.002 in the attached tree image), are these values scaled or something? should it not be same as the y_test or y_train values scale wise and magnitude wise?.

I can send you the code if that helps.

Thank you again, Regards, Karthik

tree2

ageron commented 5 years ago

Hi @KarthikMamudur , I'm glad you are enjoying the book! :) Regarding Question 1, the GBR model is an ensemble method. Its prediction is the sum of all 18 trees' predictions, so all 18 trees contribute to the quality of the predictions. However, since each tree is trained on the previous trees' remaining errors, the last trees naturally contribute less to the final performance than the first ones. So you probably want to look at the first trees in priority.

Regarding question 2, since each tree corrects the previous trees, the scale of the corrections naturally goes down as you look at the last trees.

For example, suppose the GBR is supposed to predict 100, but the first tree outputs 96, then the second tree will try to predict 4. But suppose it outputs 5, then the next tree will try to predict -1. Suppose it predicts -1.01, then the next tree will have to predict 0.01. As you can see, the scale quickly goes down. Hope this helps!

KarthikMamudur commented 5 years ago

Thank you Sir, that helped, i summed up all the predicted values from all the estimators of GBR and it matches exactly with GBR.predict(X_new) But :-) a question again This works only if the learning rate is =1.0, for other learning rates the prediction from GBR is not equal to the sum of the predictions from its estimators(there is some other calculations happening looks like). Please let me know if my understanding is correct.

Question-2 Can I get one big tree (.png format) for the GBR model (like a tree from RandomForest models)? This helps me travers down the tree given the X and find out the appropriate y_pred. So I can find the y_pred given a X_new even without a computer.

GursimranSe commented 5 years ago

Hi @aishwaryashinde6 ,

I did not really intend this code to be exposed, it's really just there to generate the figures in the book. What it does is generate and display a single image representing a grid of digit images. If you give it 23 images (indexed from 0 to 22), and ask for 10 images per row, the final image will contain 3 rows of digits (first row = images 0 to 9, second row = images 10 to 19 and third row = images 20 to 22), and since the final row is shorter than the others, the function would add 7 empty images (which is why the function appends np.zeros() at one point. That about all there is to know, the rest should be fairly self-explanatory.

It would probably be simpler to use a grid of subplots, but I think I ran into some issues with the spacing between the subplots, or something like that.

Hope this helps

hello ageron thx for great ML book i have a problem to understand this code

for row in range(n_rows):
    rimages = images[row * images_per_row : (row + 1) * images_per_row]
    row_images.append(np.concatenate(rimages, axis=1))
image = np.concatenate(row_images, axis=0)
plt.imshow(image, cmap = mpl.cm.binary, **options)

please explain this which is part of plot_digits() function

Thank for Help me

ageron commented 5 years ago

Hi @seonpy , It's really not important code at all, it just plots multiple images in a grid. I probably should have used Matplotlib subplots instead, but if I remember correctly it was really slow and I ran into some issues with the padding between the images. So I concatenated multiple images into a single image:

Say you have 7 images, A, B, C, D, E, F, G and you want to display them on 2 rows, the goal is to produce a single image that looks like this:

ABCD EFG0

To do this, I first concatenate all images on a given row of images along axis 1 (i.e., horizontally), so for example images A, B, C, D are concatenated into a single image that looks like ABCD. Then I concatenate the resulting images ABCD and EFG0 along axis 0 (i.e., vertically) into the final image:

ABCD EFG0

The 0 means an image full of zeros. Hope this helps

ageron / handson-ml

chapter 3: display of digits from MNIST dataset. #374