Attention is all you need

chullhwan-song commented 6 years ago

https://arxiv.org/abs/1706.03762
BERT리뷰와 함께 읽어보는것도 좋을것같다(거의 동일)

chullhwan-song commented 6 years ago

참조 Post Attention is all you need paper 뽀개기 http://nlp.seas.harvard.edu/2018/04/03/attention.html https://www.slideshare.net/WhiKwon/attention-mechanism https://mchromiak.github.io/articles/2017/Sep/12/Transformer-Attention-is-all-you-need/#.W7XULXtLjmH http://www.modulabs.co.kr/DeepLAB_Paper/20167 https://github.com/YBIGTA/DeepNLP-Study/wiki/Attention-Is-All-You-Need-%EB%85%BC%EB%AC%B8%EB%A6%AC%EB%B7%B0 https://github.com/strutive07/transformer-tensorflow2.0/blob/master/Attention%20is%20all%20you%20need.pdf

chullhwan-song commented 6 years ago

아래정리는 위의 refer들을 보고 정리함. > 계속 업데이트 > 어려움.
why ?
nlu 기반에서의 attention 이란 우선적으로 번역분야에 성공적으로 적용되어야 왔다.
이전 attention에서의 적용 분야는 seq2seq 기반으로 하고 있다.
즉, 여기서 self attnetion이란 개념도 이전 seq2seq영역에서 attention 에서의 이유와 거의 동일하다.
- 따라서 이전에 seq2seq기반 attention 논문을 먼저 읽는것을 권함.
즉, rnn의 약점은 long term(매우 긴 sequence)에 취약하다. = long-term dependency problem
attention mechanism은 input sequence의 길이와 상관없이 dependency있게 이러한 문제를 해결.

기발한 발상

여기까지 오니 이상한 - 그렇지만 획기적인 - 발상을 떠올린 사람이 등장했다.
RNN을 시퀀스 데이터에 사용했던 이유가 무엇인가? 
이전 스텝의 정보를 사용해서 각 스텝들 사이의 관계를 반영하여 시퀀스를 처리하기 위함이었다. 
그런데 자기 주의 메커니즘을 사용하면 마찬가지로 시퀀스 내에서 이전 스텝의 정보를 가져와 
결합할 수 있다. 그렇다면 굳이 RNN을 쓸 필요가 있을까? > Transformer(밑의 그림)

기존의 RNN이 아닌 병렬처리가 가능한 attention mechanism인 Transformer
- 즉, 병렬처리, 적은 연산수, 긴문장에 대한 dependency 계산을 이전의 연구보다 효과적으로 있는다는것에 대한 큰 contribution
- 이후, Bert등의 연구에 굉장한 연감을 줘서 더더욱~

소개(self attention)

이전연구에서보다 연산량을 줄이기 위해, Transformer 은 attention weighted position 에 정보를 평균(averaging)을 하였다.

기본구조

encoder - decoder 구조
- RNN 사용하지 않음.
- self-attention과 point wise fcn을 쌓아올리는 구조.. > 링크 아래 그림
  - Fig.1에서 왼쪽이 Encoder, 오른쪽이 decoder(상위, liear-softmax를 뺀)
각 N(=6)개 stack된 구조
encoder
- multi-head attention
- feed forward
- 각각 residual connection
- 이들을 한 모듈로 볼때 xN 번 반복
  - decoder
  - encoder 과 거의 유사, 이외,
  - masked multi-head attention
  - 마찬가지로 이들을 한 모듈로 볼때 xN 번 반복
  - 이제 여기서 중요한것이 multi-head attention
    attention
  - attention이란 주어진 input seq가 있을때, 이 sequence중의 하나와 output sequence중의 하나가 서로 관련이 높다면, 서로 주의(attention)을 기울이란 의미일수 있다.
    Scaled Dot-Product Attention
  - Q, K, V를 입력으로 받는다.
  - 먼저, 이에 대한 큰 논문에서의 설명이 없다. ㅠ > 이는 Bert 리뷰 참고
  - Q - query, K -key, V- Value
  - 이를 식으로
  - 그림과 조합해보면,
  - 는 일종의 Q와 K의 유사성을..> dot product 연산
  - scaling factor, normalization, 너무 큰값이 가지는 것을 방식, gradient가 너무 작아져서 학습에 효과가 없는것을 방지하려는듯~
  - Q, K간의 softmax를 구해, 어느것이 중요한지를 구하고 이를 V에 곱하면 실제로 attention 이 구해짐.
  - K, V가 실제 쿼리와 어떤 relation이 있을거라 가정하고(보고), Q, K간의 유사도를 구한다음, V 반영한다는 의미
  - Q와 V사이에 어떤 attention 이 존재하는 정보를 구할수 있을거라 봄.
  - 다시 그림으로 그리면,
  - Q, K, V는 실상은 같은 값같다..
    multi-head attention
  - 위의 Scaled Dot-Product Attention은 이 안에서 동작, 아래그림에서 각각의 h
  - 이를 수식으로,
  - decoder의 masked multi-head attention
  - i번째 query는 당연히 i번째까지만 attention을 얻는다는 의미.
  - a를 예측할때, b, c는 masking, b를 예측할때는 c를 masking하여 이전의 것만 attention이 주어지도록함.

Position-wise Feed-Forward Networks

각 word(=각 position)마다 적용되기 때문에 point wise network
두개의 linear transformation과 그 사이에 relu

Positional Encoding

figure 1에서 입력부분(아래)를 보면 input/output embedding 다음에 이를 실행
figure 1의 transformer는 cnn/rnn도 아님. > 다시말해서 linear transformation만 적용하는듯..cnn/rnn는 사용안함.
그래서, 이러한 sequence 형태의 모델을 사용하기 위해, word별 위치정보를 추가
input embedding과 같은 차원이서 input embedding vector와 sum할수 있음.
근데, positional encoding vector를 만들기 위해, cos/sin 함수를 이용하는게.좀 이해가 안감.
- 주파수(frequency)특성을 왜 삽입하려하는지?
- 이부분은 https://pozalabs.github.io/transformer/, http://nlp.seas.harvard.edu/2018/04/03/attention.html 블로그를 읽어보는게 좋을거 같음.
- https://mchromiak.github.io/articles/2017/Sep/12/Transformer-Attention-is-all-you-need/#.W78H9ntLjmF 블로그의 "Thoughts on the idea" 에서 2번에 이런 말이 나옴 "How actually the positional encoding work? Why they have chosen the sin/cos functions and why the position and dimension are in this relation? Finally how sinusoidal helps translate long sentences?"
- positional encoding 는 각 word position마다 i<dim차원을 가짐. 이는 각각의 그 차원들이 sin 곡선을 가지는다는 의미 인듯..
- ??

chullhwan-song / Reading-Paper

Attention is all you need #54

why ?

소개(self attention)

기본구조

attention

Scaled Dot-Product Attention

multi-head attention

Position-wise Feed-Forward Networks

Positional Encoding