Closed JinwoongKim closed 6 years ago
According to the Euclidian formula,
for i in xrange(num_test):
for j in xrange(num_train):
#####################################################################
# TODO: #
# Compute the l2 distance between the ith test point and the jth #
# training point, and store the result in dists[i, j]. You should #
# not use a loop over dimension. #
#####################################################################
dists[i,j] = np.sqrt(np.sum((X[i]-self.X_train[j])**2))
#####################################################################
# END OF YOUR CODE #
#####################################################################
return dists
Q. What in the data is the cause behind the distinctly bright rows?
Q. What causes the columns?
def predict_labels(self, dists, k=1):
"""
Given a matrix of distances between test points and training points,
predict a label for each test point.
Inputs:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
gives the distance betwen the ith test point and the jth training point.
Returns:
- y: A numpy array of shape (num_test,) containing predicted labels for the
test data, where y[i] is the predicted label for the test point X[i].
"""
num_test = dists.shape[0]
y_pred = np.zeros(num_test)
for i in xrange(num_test):
# A list of length k storing the labels of the k nearest neighbors to
# the ith test point.
closest_y = []
#########################################################################
# TODO: #
# Use the distance matrix to find the k nearest neighbors of the ith #
# testing point, and use self.y_train to find the labels of these #
# neighbors. Store these labels in closest_y. #
# Hint: Look up the function numpy.argsort. #
closest_y.append(self.y_train[np.argmin(np.argsort(dists[i]))])
#########################################################################
#########################################################################
# TODO: #
# Now that you have found the labels of the k nearest neighbors, you #
# need to find the most common label in the list closest_y of labels. #
# Store this label in y_pred[i]. Break ties by choosing the smaller #
# label. #
#########################################################################
y_pred[i] = closest_y[0]
#########################################################################
# END OF YOUR CODE #
#########################################################################
return y_pred
Got 38 / 500 correct => accuracy: 0.076000
Since I did something wrong, I checked my code. I tried to use argsort
because of the hint, but, it seems that, in my approach, I don't need to use it, so I removed it.
def predict_labels(self, dists, k=1):
"""
Given a matrix of distances between test points and training points,
predict a label for each test point.
Inputs:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
gives the distance betwen the ith test point and the jth training point.
Returns:
- y: A numpy array of shape (num_test,) containing predicted labels for the
test data, where y[i] is the predicted label for the test point X[i].
"""
num_test = dists.shape[0]
y_pred = np.zeros(num_test)
for i in xrange(num_test):
# A list of length k storing the labels of the k nearest neighbors to
# the ith test point.
closest_y = []
#########################################################################
# TODO: #
# Use the distance matrix to find the k nearest neighbors of the ith #
# testing point, and use self.y_train to find the labels of these #
# neighbors. Store these labels in closest_y. #
# Hint: Look up the function numpy.argsort. #
closest_y = [self.y_train[np.argmin(dists[i])]]
#########################################################################
#########################################################################
# TODO: #
# Now that you have found the labels of the k nearest neighbors, you #
# need to find the most common label in the list closest_y of labels. #
# Store this label in y_pred[i]. Break ties by choosing the smaller #
# label. #
#########################################################################
y_pred[i] = np.array(closest_y)
#########################################################################
# END OF YOUR CODE #
#########################################################################
return y_pred
Got 137 / 500 correct => accuracy: 0.274000
Now, accuracy is resulted out as expected.
Since dists[i]
has distance between ith test and all training sets like [ 3803.92350081 4210.59603857 5504.0544147 ..., 4007.64756434 4203.28086142 4354.20256764]
, we have to choose the minimum values among these values. To this end, I use argmin
.
Next step requires me to predict labels with 5NN, but my approach can't do that. So I googled "how to pick 5 minimum with argmin
in bumpy", and I found this,
Then, I've realized that, why they suggest me use 'argsort' LOL
Instead of argmin,
python closest_y = [self.y_train[np.argmin(dists[i])]]
I use argsort now
python closest_y = [self.y_train[np.argsort(dists[i])[:k]]]
To sum up KNN result, I googled "most frequent number in numpy array" https://stackoverflow.com/questions/6252280/find-the-most-frequent-number-in-a-numpy-vector
So, I updated my code,
python y_pred[i] = np.array(closest_y)
to
python y_pred[i] = np.argmax(np.bincount(closest_y))
But I faced this error,
ValueError: object too deep for desired array
I removed parenthesis from [closest_y = [self.y_train[np.argmin(dists[i])]]]
and it works.
Accuracy is slightly increased
Got 139 / 500 correct => accuracy: 0.278000
First try;
python dists[i,:] = np.sqrt(np.sum((X[i]-self.X_train)**2))
Difference was: 7906696.077041
Uh-oh! The distance matrices are different
Second try:
python dists[i,:] = np.sqrt(np.sum((X[i]-self.X_train[:,:])**2))
Difference was: 551335735.545133
Uh-oh! The distance matrices are different
then I start to think,
I googled "l2 distance in one loop" and found this, Numpy Broadcast to perform euclidean distance vectorized
Only difference is axis,
First try
...
dists[i,j] = np.sqrt(np.sum((X[i]-self.X_train)**2))
...
Solution
...
dists[i] = np.sqrt(np.sum((X[i] - self.X_train)**2, axis=1))
...
According to this, Axis or axes along which a sum is performed". I don't know what they are talking about. Let us see an example.
>>> np.sum([[0, 1], [0, 5]], axis=0)
array([0, 6])
>>> np.sum([[0, 1], [0, 5]], axis=1)
array([1, 5])
Okay, got it !!
If we check the result of each calculation, it makes sense.
For def compute_distances_no_loops(self, X):, I found some blogs and codes but haven't understand yet.
def compute_distances_no_loops(self, X):
...
dists = np.sqrt(np.sum(X**2,axis=1)[:,np.newaxis] + np.sum(self.X_train**2,axis=1) -2*np.dot(X, self.X_train.T))
...
First one is quite simple,
################################################################################
# TODO: #
# Split up the training data into folds. After splitting, X_train_folds and #
# y_train_folds should each be lists of length num_folds, where #
# y_train_folds[i] is the label vector for the points in X_train_folds[i]. #
# Hint: Look up the numpy array_split function. #
################################################################################
X_train_folds = np.array_split(X_train,num_folds)
y_train_folds = np.array_split(y_train,num_folds)
# `array_split` splits an array into sub-arrays.
################################################################################
# TODO: #
# Perform k-fold cross validation to find the best value of k. For each #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times, #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all #
# values of k in the k_to_accuracies dictionary. #
################################################################################
def get_folds_except_one(folds, n):
return np.concatenate(tuple([folds[i] for i in range(len(folds)) if i != n]))
for k in k_choices:
for n in range(num_folds):
# train all folds except one
current_x_train_set = get_folds_except_one(X_train_folds,n)
current_y_train_set = get_folds_except_one(y_train_folds,n)
classifier.train(current_x_train_set, current_y_train_set)
dists = classifier.compute_distances_no_loops(X_train_folds[n])
y_test_pred = classifier.predict_labels(dists, k=k)
# Compute and print the fraction of correctly predicted examples
num_correct = np.sum(y_test_pred == y_train_folds[n])
accuracy = float(num_correct) / y_train_folds[n].shape[0]
try:
k_to_accuracies[k].append(accuracy)
except KeyError:
k_to_accuracies[k] = [accuracy]
I found that I can replace these two lines,
dists = classifier.compute_distances_no_loops(X_train_folds[n])
y_test_pred = classifier.predict_labels(dists, k=k)
with this single line
y_test_pred = classifier.predict(X_train_folds[n], k=k)
Setting
After running jupyter notebook with knn.ipynb, I faced an error as follows,
I solved this problem, by adding "backend: TkAgg" into ~/.matplotlib/matplotlibrc with below command,
echo "backend: TkAgg" >> ~/.matplotlib/matplotlibrc
and restart jupyter notebook.