CH2-Linear classifiers - Githubissues

Pin-Jiun commented 1 year ago

In supervised learning we are given a training data set of the form

We’ll often use the letter h (for hypothesis) to stand for a classifier A hypothesis class H is a set (finite or infinite) of possible classifiers, each of which represents a mapping from R^d → {−1, +1}. A learning algorithm is a procedure that takes a data set Dn as input and returns an element h of H; it looks like

Given a training set Dn and a classifier h, we can define the training error of h to be

For now, we will try to find a classifier with small training error (later, with some added criteria) and hope it generalizes well to new data, and has a small test error

linear classifier in d dimensions is defined by a vector of parameters θ ∈ R^d and scalar θ0 ∈ R(一個實數) So, the hypothesis class H of linear classifiers in d dimensions is the set of all vectors in R^( d+1) ,因為包含 θ和 θ0 We’ll assume that θ is a d × 1 column vector. Given particular values for θ and θ0, the classifier is defined by

θ, θ0 as specifying a hyperplane. It divides R^d, the space our x(i) points live in, into two half-spaces

我們要如何找這條線呢?

第一個想法是隨便亂畫

test error

How should we evaluate the performance of a classifier h? The best method is to measure test error on data that was not used to train it.

Test Error比training error重要, 因為我們的目的是很好的預測''未來(Test Data)''而不是過往(training Data)

然而要如何評估learning algorithm的效能是一件非常棘手的事情在計算 test error的時候, 會受到以下幾個情況影響

Which particular training examples occurred in Dn Which particular testing examples occurred in Dn' Randomization inside the learning algorithm itself(後面會探討演算法的差異, 先不管)

通常會執行以下程序許多次 Train on a new training set->Evaluate resulting h on a testing set that does not overlap the training set->Train on a new training set->...

One concern is that we might need a lot of data to do this, and in many applications data is expensive or difficult to acquire. We can re-use data with cross validation (but it’s harder to do theoretical analysis).

在機器學習中，資料集通常會被分成三個部分：訓練集、驗證集和測試集。訓練集用於模型的訓練，測試集用於評估模型的性能，那麼為什麼還需要驗證集呢？

驗證集的作用是用於調整模型的超參數，這些超參數在模型訓練之前需要手動設置，例如學習率、正則化強度、層數等等。這些超參數的選擇會對模型的性能產生顯著的影響，因此需要進行調整，選擇最優的超參數組合。

由上述的演算法可以知道驗證集評估的是整個的演算法(取全部h平均), 而不是只有單一的hypothesis h.

調整超參數的過程通常會使用交叉驗證的方式進行。具體來說，將訓練集分成k個子集，然後使用其中的k-1個子集進行模型訓練，使用剩餘的一個子集作為驗證集進行模型評估。這樣重複進行k次，每次使用不同的子集作為驗證集，最終得到k個評估結果的平均值作為模型的性能評估結果。通過這種方式，可以避免過度擬合(over fitting, 也有人說選擇偏差)，提高模型的泛化能力。

總之，驗證集的作用是調整模型的超參數，通過交叉驗證的方式進行，選擇最優的超參數組合，從而提高模型的性能和泛化能力。

Pin-Jiun commented 1 year ago

超平面（hyperplane）是指將空間分成兩個部分的一個平面或超平面。對於二維空間，超平面就是一條直線，將平面分成兩個部分。對於三維空間，超平面就是一個平面，將空間分成上下兩個部分。

Now, we'll consider hyperplanes which do not necessarily go through the origin.

如何求點到直線的距離? Hint:使用投影長 https://www.youtube.com/watch?v=QNbM8kxMvnc&ab_channel=ntsh2102

點到直線= 點帶入 / (係數平方和)開根號 The distance should be positive if the origin is on the positive side of the hyperplane, 0 on the hyperplane and negative otherwise

所以原點到直線的距離可以寫成

Pin-Jiun commented 1 year ago

Write a procedure that takes a 2D array and returns the final column as a two dimensional array. You may not use a for loop.

import numpy as np
# Takes a 2D matrix;  returns last column as 2D matrix
def index_final_col(A):
   return A[:,-1:]

Pin-Jiun commented 1 year ago

Code for signed distance

import numpy as np
def signed_dist(x, th, th0):
    return (x.T@th+th0)/length(th)

Pin-Jiun commented 1 year ago

Write a Python function that takes as input

a column vector x a column vector th that is of the same dimension as x a scalar th0 and returns

+1 if x is on the positive side of the hyperplane encoded by (th, th0) 0 if on the hyperplane -1 otherwise.

import numpy as np
def positive(x, th, th0):
    return np.sign(th.T@x+th0)

Pin-Jiun commented 1 year ago

We define data to be a 2 by 5 array (two rows, five columns) of scalars. It represents 5 data points in two dimensions. We also define labels to be a 1 by 5 array (1 row, five columns) of 1 and -1 values.

data = np.transpose(np.array([[1, 2], [1, 3], [2, 1], [1, -1], [2, -1]]))
labels = rv([-1, -1, +1, +1, +1])

For each subproblem, provide a Python expression that sets A to the quantity specified. Note that A should always be a 2D numpy array. Only one relatively short expression is needed for each one. No loops!

A should be a 1 by 5 array of values, either +1, 0 or -1, indicating, for each point in data, whether it is on the positive side of the hyperplane defined by th, th0. Use data, th, th0 as variables in your submission

import numpy as np
A = positive(data, th, th0)

A should be a 1 by 5 array of boolean values, either True or False, indicating for each point in data and corresponding label in labels whether it is correctly classified by hyperplane th = [1, 1], th0 = -2 . That is, return True when the side of the hyperplane that the point is on agrees with the specified label.

import numpy as np
A = (labels == positive(data, cv([1, 1]), -2))

A 是一個boolean array

Pin-Jiun commented 1 year ago

Write a procedure that takes as input

data: a d by n array of floats (representing n data points in d dimensions) labels: a 1 by n array of elements in (+1, -1), representing target labels th: a d by 1 array of floats that together with th0: a single scalar or 1 by 1 array, represents a hyperplane and returns the number of points for which the label is equal to the output of the positive function on the point.

Since numpy treats False as 0 and True as 1, you can take the sum of a collection of Boolean values directly.

import numpy as np
# data is dimension d by n
# labels is dimension 1 by n
# ths is dimension d by 1
# th0s is dimension 1 by 1
# return 1 by 1 matrix of integer indicating number of data points correct for
# each separator.
def score(data, labels, th, th0):
   return np.sum(positive(data, th, th0) == labels)

Pin-Jiun commented 1 year ago

Best separator

Now assume that we have some "candidate" classifiers that we want to pick the best one out of.

Write a procedure that takes as input

and finds the hyperplane with the highest score on the data and labels. In case of a tie, return the first hyperplane with the highest score, in the form of a tuple of a d by 1 array and an offset in the form of 1 by 1 array.

import numpy as np
# data is dimension d by n
# labels is dimension 1 by n
# ths is dimension d by m
# th0s is dimension 1 by m
# return matrix of integers indicating number of data points correct for
# each separator:  dimension m x 1
def score_mat(data, labels, ths, th0s):
   pos = np.sign(np.dot(np.transpose(ths), data) + np.transpose(th0s))
   return np.sum(pos == labels, axis = 1, keepdims = True)
def best_separator(data, labels, ths, th0s):
   best_index = np.argmax(score_mat(data, labels, ths, th0s))
   return cv(ths[:,best_index]), th0s[:,best_index:best_index+1]

Pin-Jiun / Machine-Learning-MIT

CH2-Linear classifiers #2

test error

Best separator