getElementsByName commented 4 years ago

https://mathworld.wolfram.com/L2-Norm.html https://developers.google.com/machine-learning/crash-course

getElementsByName commented 4 years ago

l2 l1 비교 https://dailyheumsi.tistory.com/57

getElementsByName commented 4 years ago

xavier

http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf
layer의 activation output이 평균 0, 편차 1로 유지되도록 초기화

getElementsByName commented 4 years ago

주요 개념

Recall that different types of initializations lead to different results
Recognize the importance of initialization in complex neural networks.
Recognize the difference between train/dev/test sets
Diagnose the bias and variance issues in your model
Learn when and how to use regularization methods such as dropout or L2 regularization.
Understand experimental issues in deep learning such as Vanishing or Exploding gradients and learn how to deal with them
Use gradient checking to verify the correctness of your backpropagation implementation

getElementsByName commented 4 years ago

Setting up your Machine Learning Application

Train / Dev / Test sets

Bias / Variance

bias-variance tradeoff

Basic Recipe for Machine Learning

getElementsByName commented 4 years ago

Regularizing your neural network

Regularization

https://mathworld.wolfram.com/L2-Norm.html

Why regularization reduces overfitting?

Dropout Regularization

Understanding Dropout

Other regularization methods

getElementsByName commented 4 years ago

Setting up your optimization problem

Normalizing inputs

Vanishing / Exploding gradients

Weight Initialization for Deep Networks

Numerical approximation of gradients

Gradient checking

Gradient Checking Implementation Notes

getElementsByName commented 4 years ago

Programming assignments

Initialization

Regularization

Gradient Checking

getElementsByName commented 4 years ago

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data.
In statistics, the bias (or bias function) of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased.
https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

training/dev/test set

bias / overfitting

예측모델의 cost 값에 대한
bias
- training set이 잘 예측되지 않음
  - true value인 label값이 잘 추정되지 않음 -> 높은 error
- 해결
  - layer의 unit수 증가
  - layer 수 증가
  - more iterations
overfitting
- 특정 input에 대해서 너무 예민하게 학습이 되어있음 (an error from sensitivity to small fluctuations in the training set)
- input이 달라지면 에러값이 많이 달라짐
  - input set에 따라 cost값의 분산이 크다고도 해석이 가능
- 해결: regularization
  - weight를 아껴쓰는 방법
  - neuron를 랜덤하게 제거해서 안정적인 net으로 만듬
    - 보다 많은 수를 가진 특징(일반적인 특성)에 집중할 수 있도록 함

overffiting 해결

overffiting 판단
- variance 가 큰 가?
- validation/dev/development set
- generalizing from a training set to the dev set
regularization
- L2 regularization
- dropout
- data augmentation + more data
- early stopping (traning set error vs dev set error)

학습을 빠르게 하는 방법

input/output이 일관된 분포 (평균 0, 분산1)를 가지도록 조정
normalizing input
weight 초기화
- vanishing/exploding gradients 문제가 있기 때문에 초기에 적절한 값을 세팅하면 빠르게 학습
- tanh -> xavier initialization
- Relu -> He initialization

(gradient checking)

getElementsByName commented 4 years ago

(week1) Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

주요 개념

Recall that different types of initializations lead to different results
Recognize the importance of initialization in complex neural networks.
Recognize the difference between train/dev/test sets
Diagnose the bias and variance issues in your model
Learn when and how to use regularization methods such as dropout or L2 regularization.
Understand experimental issues in deep learning such as Vanishing or Exploding gradients and learn how to deal with them
Use gradient checking to verify the correctness of your backpropagation implementation

들어가기 전 용어

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data.
In statistics, the bias (or bias function) of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased.

Setting up your Machine Learning Application

Train / Dev / Test sets

Iterative Process

Even very experienced deep learning people find it almost impossible to correctly guess the best choice of hyperparameters the very first time.
- Intuitions from one domain or from one application area often do not transfer to other application areas.
- And the best choices may depend on the amount of data you have, the number of input features you have through your computer configuration and whether you're training on GPUs or CPUs.

Data Partitioning

You can greatly reduce your chances of overfitting by partitioning the data set into the three subsets shown in the following figure:
- https://developers.google.com/machine-learning/crash-course/validation/another-partition
분류
- train set
- development(dev) set 또는 validation set
- test set
비율
- 작은 규모의 데이터 (100, 1000, 10000)
  - train : test = 70% : 30%
  - train : dev: test = 60% : 20% : 20%
- 큰 규모의 데이터 (1,000,000~)
  - train : dev: test = 나머지 : 10,000개 : 10,000개
  - 예) 1,000,000 examples -> 98% : 1% : 1%
Not having a test set might be okay. (Only dev set.)
- unbiased estimate(test set)이 필요 없는 경우
- (test set이 없는 경우 dev set을 test set으로 잘못 명명하는 하는 경우가 있어 혼란을 줌)
  - test set에 overfitting 하고 있으면 dev set으로 불러야함

Mismatched train/test distribution

Make sure dev and test come from same distribution
- mismatched distribution 예
  - training set: webpage에서 찍은 고양이 사진
  - dev/test sets: 유저가 앱으로 찍은 고양이 사진

기타 참고

https://tensorflow.blog/%EB%A8%B8%EC%8B%A0-%EB%9F%AC%EB%8B%9D%EC%9D%98-%EB%AA%A8%EB%8D%B8-%ED%8F%89%EA%B0%80%EC%99%80-%EB%AA%A8%EB%8D%B8-%EC%84%A0%ED%83%9D-%EC%95%8C%EA%B3%A0%EB%A6%AC%EC%A6%98-%EC%84%A0%ED%83%9D-1/

Bias / Variance

Basic Recipe for Machine Learning

예측모델의 cost 값에 대한 해석

bias

training set이 잘 예측되지 않음
- true value인 label값이 잘 추정되지 않음
- Optimal(Bayes) error와의 차이가 큼 (error가 크다)

bias 해결

bigger network
- more hidden units
- more hidden layer
train longer (more iterations)
(find a new neural network architecture search)

overfitting

training set의 세밀한 부분에 너무 맞춰져 있어서 다른 data set에서의 오류 편차가 심함
특정 input에 대해서 너무 예민하게 학습이 되어있음 (an error from sensitivity to small fluctuations in the training set)
input이 달라지면 에러값이 많이 달라짐
- input set에 따라 cost값의 분산이 크다고도 해석이 가능

overffiting 해결

overffiting 판단
- validation/dev/development set
- traning set error 와 dev set error 비교
regularization
- L2 regularization
  - weight를 아껴쓰는 방법
- dropout
  - neuron를 랜덤하게 제거해서 안정적인 net으로 만듬
    - 보다 많은 수를 가진 특징(일반적인 특성)에 집중할 수 있도록 함
- data augmentation + more data
- early stopping
- (find a new neural network architecture search)
bias-variance tradeoff
- 현재는 bigger network, more data, many tools 를 통해 bias와 variance를 보다 쉽게 서로 영향 없이 동시에 줄일 수 있는 많은 딥러닝 진영에서 이전보다 언급이 많이 되고 있지는 않음.

Regularizing your neural network

Regularization

Why regularization reduces overfitting?

b보다 w에 적용
- (w가 parameter의 대부분을 차지)
L2 regularization is the most common type of regularization.
L1 regularization to make your model sparse, helps only a little bit. So I don't think it's used that much, at least not for the purpose of compressing your model.
- And what that means is that the w vector will have a lot of zeros in it. And some people say that this can help with compressing the model, because the set of parameters are zero, and you need less memory to store the model.
- https://developers.google.com/machine-learning/crash-course/regularization-for-sparsity/l1-regularization
Frobenius norm (a matrix norm)
- l2 norm (vector norm)
- https://mathworld.wolfram.com/L2-Norm.html
weight decay (shrink weight)
- parameter를 업데이트 할 때, w가 1보다 작은 값으로 곱해짐
- hidden unit의 영향을 점점 작게 만듬 (-> 좀 더 단순한 network)
제한된 weight의 크기로 아껴씀
- penalizes the weight being too large.
- 보다 많은 수를 가진 특징(일반적인 특성)에 집중할 수 있도록 함
구현
- cost function / backward propagation (weight 미분식) 에 적용
- lambda: regularization parameter
  - lambd: python variable (lambda: reserved keyword)

Dropout Regularization

Understanding Dropout

keep_prob: hidden unit이 유지될 확률
- 예) keep_prob = 0.8 (hidden unit를 0.2 확률로 랜덤하게 적용하지 않음)
- 대부분 hidden layer에 적용
Inverted dropout: input 크기로 맞추기 위해 scaling
- output 크기가 input크기에 비해 keep_prob 만큼 작게되어 맞춤
stable한 network를 만들 수 있음
computer vision 분야에서 많이 사용
- 대부분 데이터가 부족하기 때문에 overfitting 이 발생
구현
- forward / backward propagation에 적용
- test time에는 적용하지 않음

Other regularization methods

data augmentation

예) 이미지 대칭, 회전, 크롭, 왜곡(distortion)
- this isn't as good as if you had collected an additional set of brand new independent examples.
- 하지만 값싸게 데이터를 추가할 수 있음

early stopping

dev set error가 training set error에 비해 크게 증가하기 시작하면 중단
- computation time을 줄일 수 있는 장점
orthogonalization 관점에서는 좋지 않음
- 하나의 단계에서는 하나의 문제에만 집중
- training 단계에서는 bias만 해결하려고 하고 overfitting은 나중에 해결

Setting up your optimization problem

Normalizing inputs

평균 0, 편차 1로 feature scaling
빠르게 최적값을 찾아갈 수 있음
- 같은 learning rate에 대해 각 feature의 최적의 weight를 비슷한 속도로 찾을 수 있음

Vanishing / Exploding gradients

deep(layer수가 많은) network 모델을 단순화하여 생각해봄
- weight가 중접되어 곱해지기 때문에 weight의 절대값이 1보다 작으면 결과값이 0에 가깝고 절대값이 1보다 크면 결과값이 커짐
- cost 값이 커짐/작아짐 -> 기울기 값이 커짐/작아짐 -> 학습하기 어려워짐

Weight Initialization for Deep Networks

vanishing/exploding gradients 문제가 있기 때문에 초기에 적절한 parameter를 세팅하면 빠르게 학습시킬 수 있음
layer의 activation output이 평균 0, 편차 1로 유지되도록 초기화
- http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf
layer수가 커지면 w가 작은 값이어야 결과값을 유지시킬 수 있음
- w 편차를 줄이기 위해 Gaussian Random variables에 특정 값을 곱해 줌
https://reniew.github.io/13/
- tanh -> LeCun 또는 xavier initialization
- Relu -> He initialization

Numerical approximation of gradients

Gradient checking

Gradient Checking Implementation Notes

+/- epsilon (two triangle) 에 대한 결과 값의 차이로 기울기의 근사값을 추정할 수 있음
backward propagation 검증
- forward propagation(구현 난이도가 낮기 때문에 문제가 없다고 가정)를 이용하여 검증
분자: 차이값의 Euclidean distance
분모: normalize by the lengths

getElementsByName / tmp-doc

deeplearning - w2/1 #1

주요 개념

Setting up your Machine Learning Application

Train / Dev / Test sets

Bias / Variance

Basic Recipe for Machine Learning

Regularizing your neural network

Regularization

Why regularization reduces overfitting?

Dropout Regularization

Understanding Dropout

Other regularization methods

Setting up your optimization problem

Normalizing inputs

Vanishing / Exploding gradients

Weight Initialization for Deep Networks

Numerical approximation of gradients

Gradient checking

Gradient Checking Implementation Notes

Programming assignments

Initialization

Regularization

Gradient Checking

training/dev/test set

bias / overfitting

overffiting 해결

학습을 빠르게 하는 방법

(gradient checking)

(week1) Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

주요 개념

들어가기 전 용어

Setting up your Machine Learning Application

Train / Dev / Test sets

Iterative Process

Data Partitioning

Mismatched train/test distribution

기타 참고

Bias / Variance

Basic Recipe for Machine Learning

bias

bias 해결

overfitting

overffiting 해결

Regularizing your neural network

Regularization

Why regularization reduces overfitting?

Dropout Regularization

Understanding Dropout

Other regularization methods

data augmentation

early stopping

Setting up your optimization problem

Normalizing inputs

Vanishing / Exploding gradients

Weight Initialization for Deep Networks

Numerical approximation of gradients

Gradient checking

Gradient Checking Implementation Notes

Programming assignments

Initialization

Regularization

Gradient Checking