Calemsy / Machine-Learning-2017-Fall

Hung-yi Lee
1 stars 0 forks source link

Part 6. Logistic Regression #6

Open Calemsy opened 5 years ago

Calemsy commented 5 years ago

Step 1: Function Set

对于二分类问题,我们想要找一个posterior probability: $P(C_1|x)$,如果$P(C_1|x) \gt 0.5$,那么类别为$C_1$;如果$P(C_1|x) \lt 0.5$,那么类别为$C_2$。通过计算$\mu_1,\mu_2,\Sigma$我们可以得到:

$$P(C_1|x) = \sigma(z)$$

其中: $$ \sigma(z) = \frac{1}{1 + exp(-z)}, z = wx + b$$

Function Set: $$f{w, b}(x) = P{w, b}(C_1|x)$$ including all different w and b.

image

Step 2: Goodness of a Function

对于N笔training data,$\{(x^1, C_1),(x^2, C_1),(x^3, C_2),\cdots,(x^N, C_1)\}$,假设这些data是由我们所定义的posterior probability所产生的,那么某一组$(w, b)$产生training data的几率为:

$$L(w, b) = f{w, b}(x^1)f{w, b}(x^2)\big(1 - f{w, b}(x^3)\big)\cdots f{w, b}(x^N)$$

$$w^{*}, b^{*} = argmax_{w, b} L(w, b)$$

等同于

$$w^{*}, b^{*} = argmin_{w, b} -ln L(w, b)$$

$$ \begin{align} -ln L(w, b) &= -ln f{w, b}(x^1)f{w, b}(x^2)\big(1 - f{w, b}(x^3)\big)\cdots f{w, b}(x^N)\\ & = -ln f{w, b}(x^1) - ln f{w, b}(x^2) -ln (1 - f{w, b}(x^3)) - \cdots - ln f{w, b}(x^N) \end{align} $$

设定$C_1$对应的label为1,即$\hat{y} = 1$;$C2$对应的label为0,即$\hat{y}=0$,这样 $$ln f{w, b}(x^n) = \hat{y}^n ln f(x^n) + (1 - \hat{y}^n)ln(1 - f(x^n))$$

$$w^{*}, b^{*} = argmin{w, b} \sum{n}^{N} -[\hat{y}^n ln f(x^n) + (1 - \hat{y}^n)ln(1 - f(x^n))]$$

其中:$-[\hat{y}^n ln f(x^n) + (1 - \hat{y}^n)ln(1 - f(x^n))]$是两个伯努利分布的cross entropy。

p分布:$p(x=1) = \hat{y}^n, p(x=0) = 1 - \hat{y}^n$ q分布:$q(x=1) = f(x^n), q(x=0) = 1 - f(x^n)$ p和q的交叉熵:$H(p, q) = -\sum_x p(x)lnq(x)$

$$J(w, b) = \sum{n}^{N} -[\hat{y}^n ln f(x^n) + (1 - \hat{y}^n)ln(1 - f(x^n))] = \sum{n}^{M}C(f(x^n), \hat{y}^n)$$

Step 3: Find the best function

$$\frac{\partial J(w, b)}{\partial wi} = \sum{n}^{N}-[\hat{y}^n\frac{1}{\sigma(z^n)}\frac{\partial \sigma(z^n)}{\partial w_i} - (1 - \hat{y}^n)(\frac{1}{1 - \sigma(z^n)})\frac{\partial \sigma(z^n)}{\partial w_i}]$$

$$\frac{\partial\sigma(z^n)}{\partial w_i} = \frac{\partial\sigma(z^n)}{\partial z^n}\frac{\partial z^n}{\partial w_i} = \sigma(z^n)(1 - \sigma(z^n))x_i$$

$$ \begin{align} \frac{\partial J(w, b)}{\partial wi} &= \sum{n}^{N}-[\hat{y}^n(1 - \sigma(z^n))x_i - (1 - \hat{y}^n)\sigma(z^n)xi]\\ & = \sum{n}^{N}-[\hat{y}^n - \hat{y}^n\sigma(z^n) - \sigma(z^n) + \hat{y}^n\sigma(z^n) ]xi\\ & = \sum{n}^{N}-(\hat{y}^n - \sigma(z^n))xi \\ & = \sum{n}^{N}-(\hat{y}^n - f_{w, b}(x^n))x_i \\ \end{align} $$

$$w_i = wi - \eta\sum{n}^{N}-(\hat{y}^n - f_{w, b}(x^n))x_i $$

image

why not logistic regression + square error

如果logistic regression使用mean square error的话,cost function如下:

$$J(w, b) = \frac{1}{2}\sum{n}^{N}(f{w, b}(x^n) - \hat{y}^n)^2$$

此时: $$\frac{\partial (f_{w, b}(x^n) - \hat{y}^n)^2}{\partial wi} = 2(f{w, b}(x^n) - \hat{y}^n)\frac{\partial f{w, b}(x^n)}{\partial z}\frac{\partial z}{\partial x} = 2(f{w, b}(x^n) - \hat{y}^n)f{w, b}(x^n)(1 - f{w, b}(x^n))x_i$$

当$\hat{y}^n = 1$:

当$\hat{y}^n = 0$的时候存在同样的问题。所以如果在logistic regression中使用square error,在距离目标值很近的时候,梯度很小,这是合理的;但是在距离目标值很远的地方,梯度也很小,这将会使得学习变得很困难。但是如果使用cross entropy,在距离目标值很远的地方,梯度很大(因为正比于$\hat{y}^n - f_{w, b}(x^n)$).

image

Calemsy commented 5 years ago

Discriminative v.s. Generative

Logistic regression属于discriminative 类的方法,而上一篇中使用高斯分布来描述posterior probability的方法称为generative类的方法。其实不管是logistic regression还是概率模型,其function set都是一致的(前提是几率模型的covariance matrix是共用的): $$P(C_1|x) = \sigma(w^T x + b)$$

Logistic regression直接从训练数据中求解最优的$w, b$,而概率模型基于对probability distribution服从高斯分布的假设出发,通过计算$\mu_1, \mu_2, \Sigma$进行$w, b$的求解。(The same function set, but different function is selected by the same training data).

image 一般来说,discriminative model通常要优于generative model。

Tool Example

image

$$P(C_1) = \frac{1}{13}, P(C_2) = \frac{12}{13}$$

$$P(x_1 = 1|C_1) = 1, P(x_1 = 1|C_2) = \frac13, P(x_2 = 1|C_1) = 1, P(x_2 = 1|C_2) = \frac13$$

$$P(C_1|x) = \frac{P(C_1)P(x|C_1)}{P(C_1)P(x|C_1) + P(C_2)P(x|C_2)} = \frac{\frac{1}{13}\times 1\times 1}{\frac{1}{13}\times 1\times 1 \times \frac{12}{13} \times \frac13 \times \frac13} < 0.5$$

所以generative model给出的结果是:$(1, 1)$输入$\text{class} 2$.

这是因为:

如果使用discriminative model(logistic regression),那么分类的结果是class 1。

Summary:

Multi-class Classification

image

Loss function: Cross entropy

$$L(w, b) = -\sum{n=1}^{N}\sum{c=1}^{C}\hat{y}^n_c ln(y^n_c)$$ 其中,C为类别的个数,N为样本的个数。

Calemsy commented 5 years ago

Limitation of Logistic Regression

image

很明显,Logistic Regression不能正确的划分以上的数据,但是可以通过$\text{feature transform}$进行解决。 $$x_1^{'} = \text{distance to } [0, 0]^{T}, \ x_2^{'} = \text{distance to } [1, 1]^{T}$$

image

但是这样的特征转换需要domain knowledge。我们可以通过叠加多个logistic regression来自动学习得到。 feature transform can achieve by cascading logistic regression model

image

蓝色和绿色的logistic regression负责特征抽取,红色的负责分类。以下演示了其工作的机制。

image image

将这些logistic regression称作neural,将这些neural连接起来,就得到了神经网络。

This is Deep Learning!