assignment1-1 - Githubissues

JinwoongKim commented 6 years ago

Setting

After running jupyter notebook with knn.ipynb, I faced an error as follows,

screen shot 2018-08-12 at 14 06 11

I solved this problem, by adding "backend: TkAgg" into ~/.matplotlib/matplotlibrc with below command, echo "backend: TkAgg" >> ~/.matplotlib/matplotlibrc and restart jupyter notebook.

JinwoongKim commented 6 years ago

def compute_distances_two_loops(self, X):

According to the Euclidian formula,

  for i in xrange(num_test):
      for j in xrange(num_train):
        #####################################################################
        # TODO:                                                             #
        # Compute the l2 distance between the ith test point and the jth    #
        # training point, and store the result in dists[i, j]. You should   #
        # not use a loop over dimension.                                    #
        #####################################################################
        dists[i,j] = np.sqrt(np.sum((X[i]-self.X_train[j])**2))
        #####################################################################
        #                       END OF YOUR CODE                            #
        #####################################################################
    return dists

Q. What in the data is the cause behind the distinctly bright rows?

It says the similarity. brighter point means test and train are quite similar in L2 distance.

Q. What causes the columns?

Each point represents the similarity(L2 distance) of each test example and train example

JinwoongKim commented 6 years ago

  def predict_labels(self, dists, k=1):
    """
    Given a matrix of distances between test points and training points,
    predict a label for each test point.

    Inputs:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      gives the distance betwen the ith test point and the jth training point.

    Returns:
    - y: A numpy array of shape (num_test,) containing predicted labels for the
      test data, where y[i] is the predicted label for the test point X[i].
    """
    num_test = dists.shape[0]
    y_pred = np.zeros(num_test)
    for i in xrange(num_test):
      # A list of length k storing the labels of the k nearest neighbors to
      # the ith test point.
      closest_y = []
      #########################################################################
      # TODO:                                                                 #
      # Use the distance matrix to find the k nearest neighbors of the ith    #
      # testing point, and use self.y_train to find the labels of these       #
      # neighbors. Store these labels in closest_y.                           #
      # Hint: Look up the function numpy.argsort.                             #
      closest_y.append(self.y_train[np.argmin(np.argsort(dists[i]))])
      #########################################################################
      #########################################################################
      # TODO:                                                                 #
      # Now that you have found the labels of the k nearest neighbors, you    #
      # need to find the most common label in the list closest_y of labels.   #
      # Store this label in y_pred[i]. Break ties by choosing the smaller     #
      # label.                                                                #
      #########################################################################
      y_pred[i] = closest_y[0]
      #########################################################################
      #                           END OF YOUR CODE                            #
      #########################################################################

    return y_pred

Got 38 / 500 correct => accuracy: 0.076000

Since I did something wrong, I checked my code. I tried to use argsort because of the hint, but, it seems that, in my approach, I don't need to use it, so I removed it.

JinwoongKim commented 6 years ago

    def predict_labels(self, dists, k=1):
    """
    Given a matrix of distances between test points and training points,
    predict a label for each test point.

    Inputs:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      gives the distance betwen the ith test point and the jth training point.

    Returns:
    - y: A numpy array of shape (num_test,) containing predicted labels for the
      test data, where y[i] is the predicted label for the test point X[i].
    """
    num_test = dists.shape[0]
    y_pred = np.zeros(num_test)
    for i in xrange(num_test):
      # A list of length k storing the labels of the k nearest neighbors to
      # the ith test point.
      closest_y = []
      #########################################################################
      # TODO:                                                                 #
      # Use the distance matrix to find the k nearest neighbors of the ith    #
      # testing point, and use self.y_train to find the labels of these       #
      # neighbors. Store these labels in closest_y.                           #
      # Hint: Look up the function numpy.argsort.                             #
      closest_y = [self.y_train[np.argmin(dists[i])]]
      #########################################################################
      #########################################################################
      # TODO:                                                                 #
      # Now that you have found the labels of the k nearest neighbors, you    #
      # need to find the most common label in the list closest_y of labels.   #
      # Store this label in y_pred[i]. Break ties by choosing the smaller     #
      # label.                                                                #
      #########################################################################
      y_pred[i] = np.array(closest_y)

      #########################################################################
      #                           END OF YOUR CODE                            #
      #########################################################################

    return y_pred

Got 137 / 500 correct => accuracy: 0.274000

Now, accuracy is resulted out as expected.

Since dists[i] has distance between ith test and all training sets like [ 3803.92350081 4210.59603857 5504.0544147 ..., 4007.64756434 4203.28086142 4354.20256764], we have to choose the minimum values among these values. To this end, I use argmin.

JinwoongKim commented 6 years ago

Next step requires me to predict labels with 5NN, but my approach can't do that. So I googled "how to pick 5 minimum with argmin in bumpy", and I found this,

screen shot 2018-08-12 at 20 03 11 link

Then, I've realized that, why they suggest me use 'argsort' LOL

Instead of argmin, python closest_y = [self.y_train[np.argmin(dists[i])]]

I use argsort now python closest_y = [self.y_train[np.argsort(dists[i])[:k]]]

JinwoongKim commented 6 years ago

To sum up KNN result, I googled "most frequent number in numpy array" https://stackoverflow.com/questions/6252280/find-the-most-frequent-number-in-a-numpy-vector

So, I updated my code, python y_pred[i] = np.array(closest_y) to python y_pred[i] = np.argmax(np.bincount(closest_y)) But I faced this error,

ValueError: object too deep for desired array

JinwoongKim commented 6 years ago

I removed parenthesis from [closest_y = [self.y_train[np.argmin(dists[i])]]] and it works.

Accuracy is slightly increased

Got 139 / 500 correct => accuracy: 0.278000

JinwoongKim commented 6 years ago

compute_distances_one_loop

First try; python dists[i,:] = np.sqrt(np.sum((X[i]-self.X_train)**2)) Difference was: 7906696.077041 Uh-oh! The distance matrices are different

Second try: python dists[i,:] = np.sqrt(np.sum((X[i]-self.X_train[:,:])**2)) Difference was: 551335735.545133 Uh-oh! The distance matrices are different

then I start to think,

JinwoongKim commented 6 years ago

I googled "l2 distance in one loop" and found this, Numpy Broadcast to perform euclidean distance vectorized

Only difference is axis,

First try

...
dists[i,j] = np.sqrt(np.sum((X[i]-self.X_train)**2))
...

Solution

...
dists[i] = np.sqrt(np.sum((X[i] - self.X_train)**2, axis=1))
...

According to this, Axis or axes along which a sum is performed". I don't know what they are talking about. Let us see an example.

>>> np.sum([[0, 1], [0, 5]], axis=0)
array([0, 6])
>>> np.sum([[0, 1], [0, 5]], axis=1)
array([1, 5])

Okay, got it !!

JinwoongKim commented 6 years ago

If we check the result of each calculation, it makes sense.

For def compute_distances_no_loops(self, X):, I found some blogs and codes but haven't understand yet.

def compute_distances_no_loops(self, X):
...
dists = np.sqrt(np.sum(X**2,axis=1)[:,np.newaxis] + np.sum(self.X_train**2,axis=1) -2*np.dot(X, self.X_train.T))
...

JinwoongKim commented 6 years ago

Cross-validation

First one is quite simple,

################################################################################
# TODO:                                                                        #
# Split up the training data into folds. After splitting, X_train_folds and    #
# y_train_folds should each be lists of length num_folds, where                #
# y_train_folds[i] is the label vector for the points in X_train_folds[i].     #
# Hint: Look up the numpy array_split function.                                #
################################################################################

X_train_folds = np.array_split(X_train,num_folds)
y_train_folds = np.array_split(y_train,num_folds)

# `array_split` splits an array into sub-arrays.

JinwoongKim commented 6 years ago


################################################################################
# TODO:                                                                        #
# Perform k-fold cross validation to find the best value of k. For each        #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times,   #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all     #
# values of k in the k_to_accuracies dictionary.                               #
################################################################################

def get_folds_except_one(folds, n):
    return np.concatenate(tuple([folds[i] for i in range(len(folds)) if i != n]))

for k in k_choices:
    for n in range(num_folds):
        # train all folds except one
        current_x_train_set = get_folds_except_one(X_train_folds,n)
        current_y_train_set = get_folds_except_one(y_train_folds,n)

        classifier.train(current_x_train_set, current_y_train_set)

        dists = classifier.compute_distances_no_loops(X_train_folds[n])
        y_test_pred = classifier.predict_labels(dists, k=k)

        # Compute and print the fraction of correctly predicted examples
        num_correct = np.sum(y_test_pred == y_train_folds[n])
        accuracy = float(num_correct) / y_train_folds[n].shape[0]

        try:
            k_to_accuracies[k].append(accuracy)
        except KeyError:
            k_to_accuracies[k] = [accuracy]

JinwoongKim commented 6 years ago

I found that I can replace these two lines,

dists = classifier.compute_distances_no_loops(X_train_folds[n])
y_test_pred = classifier.predict_labels(dists, k=k)

with this single line

 y_test_pred = classifier.predict(X_train_folds[n], k=k)

JinwoongKim / cs231n

assignment1-1 #1

Setting

def compute_distances_two_loops(self, X):

compute_distances_one_loop

Cross-validation