CH6-Margin Maximization

Pin-Jiun commented 1 year ago

margin 最靠近直線的資料點到直線的距離

Consider the following data, and two potential separators (red and blue).

data = np.array([[1, 2, 1, 2, 10, 10.3, 10.5, 10.7],
                 [1, 1, 2, 2,  2,  2,  2, 2]])
labels = np.array([[-1, -1, 1, 1, 1, 1, 1, 1]])
blue_th = np.array([[0, 1]]).T
blue_th0 = -1.5
red_th = np.array([[1, 0]]).T
red_th0 = -2.5

The situation is illustrated in the figure below.

What are the values of each score (S sum ,S min ,S max ) on the red separator? Solution: [31.5, -1.5, 8.2]

Which of these separators maximizes S sum ? Solution: red

Which separator maximizes S min ? 這才是主要要看的! Solution: blue

1F) Which score function should we prefer if our goal is to find a separator that generalizes better to new data? Solution: S_min

S min is the score function that picks the blue separator, which is preferred in this case because it does not make any mistakes on the training data while having the same complexity as the red separator.

Pin-Jiun commented 1 year ago

What a loss

Recall that the margin of a given point is defined as

Based on the previous part, we've decided to try to find a linear separator that maximizes the minimum margin (the distance between the separator and the points that come closest to it.)

We define the margin of a data set (X,Y), with respect to a separator as

an approach to this problem is to specify a value γ ref for the margin of the data set, and then seek to find a linear separator that maximizes γ ref .

Suppose for our data set we find that the maximum γ ref across all linear separators is 0. Is our data linearly separable? Solution: No Intuitively, a maximum γ ref of 0 says that our best separator is distance 0 from at least two data points having opposite labels. (Otherwise, we could move the separator off both of these points and achieve positive margin.) Thus, these data points, and our data set as a whole, cannot be separated by a hyperplane.

2C) For this subproblem, assume that γ ref >0 (i.e., the data is linearly separable). Note that in this case, the Perceptron algorithm is guaranteed to find a separator that correctly classifies all of the data points. What is the largest minimum margin guaranteed by running the Perceptron algorithm on a data set that has a maximum margin equal to γ ref >0?

Solution: some ϵ where ϵ>0 Explanation: Since the maximum minimum margin is equal to some γ ref >0, we know that the data is linearly separable. Thus, Perceptron is guaranteed to correctly classify all data points, implying the resulting margin will be >0. We can also upper-bound the resulting margin by γ ref , since this is given as the largest possible margin for a linear separator on this dataset. However, Perceptron does not necessarily find the separator with the largest margin. Beyond whether the points are classified correctly, Perceptron pays no attention to the margin. That is, Perceptron could converge on a separator that is arbitrarily close to any of the points as long as the points are classified correctly.

A typical form of the optimization problem is to minimize an objective that has the form

We first consider the objective of finding a maximum-margin separator using the format above, using the so-called "zero-infinity" loss

For a linearly separable data set, positive λ, and positive R given nonzero θ, what is true about the minimal value of J 0,∞ Which of the following is true:

Solution: It is always finite and positive Explanation: If the data is linearly separable, there exists a separator that classifies perfectly, so the only term in the objective function is the regularization term, which will must be positive and finite, given finite and positive R.

For a non linearly separable data set, and positive λ and γ ref , what is true about the minimal value of J 0,∞ : Solution: It is infinite

Pin-Jiun commented 1 year ago

Hinge Loss

in real data sets it is relatively rare that the data is linearly separable, so our algorithm should be able to handle this case also and still work toward an optimal, though imperfect, linear separator.

Instead of using ( 0 , ∞ ) loss, we should design a loss function that will let us "relax" the constraint that all of the points have margin bigger than γ ref , while still encouraging large margins.

The hinge loss is one such more relaxed loss function; we will define it in a way that makes a connection to the problem we are facing:

When the margin of the point is greater than or equal to γ ref , we are happy and the loss is 0; while when the margin is less than γ ref , we have a positive loss that increases the further away the margin is from γ ref .

3A) Given this definition, if γ ref is positive what can we say about L ℎ , no matter what finite values θ and θ0 take on? Solution: It is always >= 0

Here is a separator and three points. The dotted lines represent the margins determined by γ ref .

data = np.array([[1.1, 1, 4],[3.1, 1, 2]])
labels = np.array([[1, -1, -1]])
th = np.array([[1, 1]]).T
th0 = -4

What is Lh for each point, where γ ref = √2 /2? Enter the values in the same order as the respective points are listed in data. You can do this computationally or by hand. Solution: [0.8, 0.0, 3.0]

Pin-Jiun commented 1 year ago

4) SVM (support vector machine)支援向量機 objective

Putting hinge loss and regularization together, we can look at regularized average hinge loss:

4A) If the data is linearly separable and we use the SVM objective, if we now let λ=0 and find the minimizing values of θ,θ0 , what will happen? Solution: The minimal objective value will be 0

資料點相同, λ不同對separators的影響

Hint: 以下的圖都是經過最佳化所得到的結果, J(θ,θ0)是相同的 4B) Consider the following plots of separators. They are for λ values of 0 and 0.001. Match each λ to a plot. ∥θ∥a = ((-0.0737901)2+2.408472052)**0.5 = 2.409602165190182, 同理∥θ∥b =2.5677

∥θ∥ : a<b , 因為γref=1/∥θ∥, 故 γref a>b 但是γ差異不明顯, 可以視為Lh是差不多的(都趨近為0)

因為J(θ,θ0)是相同的, λ∥θ∥**2 也要相同, 故λa >λb

[lambdaA, lambdaB] Solution: [0.001, 0.0]

4C) Consider the following three plots of separators. They are for λ values of 0, 0.001, and 0.03. Match to the plot.

[lambdaA, lambdaB, lambdaC] Solution: [0.03, 0.0, 0.001] λ越大表示越不會overfitting, 所以有可能造成training的時候misclassification的情況, 故A一定是0.03 剩下B跟C, 判斷方式和4B相同

兩者都完美的分割資料集, 可以視為J(θ,θ0)是相同的因為∥θ∥ : B>C , 所以　λB<λC

Pin-Jiun commented 1 year ago

Linear Support Vector Machines

The training objective for the Support Vector Machine (with slack) can be seen as optimizing a balance between the average hinge loss over the examples and a regularization term that tries to keep θ small (or equivalently, increase the margin). This balance is set by the regularization parameter λ. Here we only consider the case without the offset parameter θ0 (setting it to zero) and rewrite the training objective as an average so that it is given by

In this problem we will optimize the training objective using a single training example, so that we can gain a better understanding of how the regularization parameter, λ, affects the result. To this end, we refer to the single training example as the feature vector and label pair, (x,y). We will then try to find a θ that minimizes

Pin-Jiun / Machine-Learning-MIT