Open SSoroushm opened 5 years ago
I failed to understand the solution to this exercise as well.
The code seems to split X_train, which contains 8000 instances (and not the 10000 required by the exercise), into 1000 subsets and a test-set containing 7900 instances.
Further explanation is greatly appreciated.
Thank you
Hi @Mehdi-Amine ,
Thanks for your questions. I double checked the exercise and its solution, and it looks fine to me, so let me explain. We want to train 1,000 trees on various subsets of the training set. The training set contains 8,000 instances, so we will have 1,000 subsets of 100 elements each, randomly sampled from these 8,000 instances. To sample 100 instances from the training set, I proposed to use the ShuffleSplit
class, but perhaps that was a bad idea because it can be confusing: it generates a split with 100 instances for the mini-training set, and the remaining 7,900 instances are just ignored, since we don't need them at all. Maybe a better idea would have been to use np.random.permutation()
:
n_trees = 1000
n_instances = 100
mini_sets = []
for _ in range(n_trees):
mini_train_index = np.random.permutation(8000)[:100]
X_mini_train = X_train[mini_train_index]
y_mini_train = y_train[mini_train_index]
mini_sets.append((X_mini_train, y_mini_train))
Although in this case the random sampling would be with replacement, instead of without replacement: i.e., mini_train_index
may contain the same index more than once, so the mini-training set may contain duplicates. The probability is fairly low, and it doesn't matter much anyway.
Moreover, as you noted, there is a redundancy: y_pred
is computed for each tree both in cell [24] and cell [25]. It's so fast that it doesn't matter much, but you are correct about this, and the code could definitely be improved.
Hope this helps.
Hi, I am wondering in exercise 8 ln 24, should the tree classifier predict for
X_mini_test
which has 7900 elements instead ofX_test
as below?If yes, then
mini_sets
for test elements should be also added in the for loop in ln 23 likeOtherwise, predicting X_test in ln 25 is redundant since y_pred was calculated before in ln 24.
Thanks,