Chapter 6, exercise 8 - Githubissues

SSoroushm commented 5 years ago

Hi, I am wondering in exercise 8 ln 24, should the tree classifier predict for X_mini_test which has 7900 elements instead of X_test as below?

for tree, (X_mini_train, y_mini_train) in zip(forest, mini_sets):
    tree.fit(X_mini_train,y_mini_train)
    y_pred = tree.predict(X_mini_test)
    accuracy_scores.append(accuracy_score(y_mini_test, y_pred))

If yes, then mini_sets for test elements should be also added in the for loop in ln 23 like

for mini_train_index,mini_test_index in rs.split(X_train):
    X_mini_train=X_train[mini_train_index]
    y_mini_train=y_train[mini_train_index]
    X_mini_test=X_train[mini_test_index]
    y_mini_test=y_train[mini_test_index]
    mini_sets.append((X_mini_train,y_mini_train))
    mini_sets_test.append((X_mini_test,y_mini_test))

Otherwise, predicting X_test in ln 25 is redundant since y_pred was calculated before in ln 24.

Thanks,

Mehdi-Amine commented 5 years ago

I failed to understand the solution to this exercise as well.
The code seems to split X_train, which contains 8000 instances (and not the 10000 required by the exercise), into 1000 subsets and a test-set containing 7900 instances.

Further explanation is greatly appreciated.

Thank you

ageron commented 5 years ago

Hi @Mehdi-Amine ,

Thanks for your questions. I double checked the exercise and its solution, and it looks fine to me, so let me explain. We want to train 1,000 trees on various subsets of the training set. The training set contains 8,000 instances, so we will have 1,000 subsets of 100 elements each, randomly sampled from these 8,000 instances. To sample 100 instances from the training set, I proposed to use the ShuffleSplit class, but perhaps that was a bad idea because it can be confusing: it generates a split with 100 instances for the mini-training set, and the remaining 7,900 instances are just ignored, since we don't need them at all. Maybe a better idea would have been to use np.random.permutation():

n_trees = 1000
n_instances = 100

mini_sets = []

for _ in range(n_trees):
    mini_train_index = np.random.permutation(8000)[:100]
    X_mini_train = X_train[mini_train_index]
    y_mini_train = y_train[mini_train_index]
    mini_sets.append((X_mini_train, y_mini_train))

Although in this case the random sampling would be with replacement, instead of without replacement: i.e., mini_train_index may contain the same index more than once, so the mini-training set may contain duplicates. The probability is fairly low, and it doesn't matter much anyway.

Moreover, as you noted, there is a redundancy: y_pred is computed for each tree both in cell [24] and cell [25]. It's so fast that it doesn't matter much, but you are correct about this, and the code could definitely be improved.

Hope this helps.

ageron / handson-ml

Chapter 6, exercise 8 #406