Pin-Jiun / Machine-Learing-NTU

0 stars 0 forks source link

6.1-Self-Attention #40

Open Pin-Jiun opened 1 year ago

Pin-Jiun commented 1 year ago

自注意力(Self-Attention) 機制

至目前為止,model用到的輸入皆可看為一個vector 但遇到更複雜的輸入時,像是輸入為一個sequence或是每次輸入長短不一的向量怎麼辦?

image

例如你的Input是一段話, output是翻譯出來的話, input/output長短就不是固定的

image

如果使用one-hot表示每一個詞彙, 每一個單字之間都沒有關連, 而且輸入會變得很龐大 Word Embedding就可以解決單字之間都沒有關連的問題

輸入的長短很複雜的情況有以下幾個例子

例如將語音當成input image

例如graph(social network) image

分子也是一個很複雜的輸入 image

接下來看輸出也很不同的狀況

每個輸入的vector皆有label

image

像是詞性標註, social netwoek裡每個節點是否購買商品

一整串vector只輸出一個label

image

最後一種就是model自己決定要輸出多少個label, 例如翻譯

image

這堂課先來專注看每個輸入的vector皆有label的解決方法


Sequence to Label-每個輸入的vector皆有其對應的label輸出

input number = output number

第一個想法是各個擊破

image

每個單字就給他跑一個FC 對I saw a saw(我看到一把鋸子)做POS tagging分類,如果模型一單一個字彙訓練模型,句子中的兩個saw應該會被判斷為同一類型,但這不符合分類預期!這樣很明顯saw會不知道要label什麼

image

目前我們使用FC(Fully connected) neural network,

所以模型只針對一個字彙訓練,無法判斷名詞動詞,應該要考慮字彙間的關係,使FC考慮上下文的關係,用一個window選擇部分sequence

But,此方法有極限,像是如果window 涵蓋整個sequence,容易導致訓練參數量暴增,且易overfitting!


解決方法 — Self-attention

Self-attention會吃一整個sequence的資訊,輸出相同數量的結果,且在訓練時他考慮一整個sequence 。 image

例如輸入4個vector, 會output 4個vector output的4個vector是考慮一整個sequence所得到的結果

Self-attention和FC可以疊加多次使用 Self-attention考慮整個sequence的資訊, FC則負責處理某個位置的資訊

Attention is all you need! (GOOGLE很有名的一篇PAPER) image

Pin-Jiun commented 1 year ago

What is self-attention?

attention的本質就是, 看一張圖片的時候你只會注意到你有興趣的位置

例如你在找飯店 image

If you think that self-attention is similar, the answer is yes! They fundamentally share the same concept and many common mathematical operations.

A self-attention module takes in n inputs and returns n outputs. What happens in this module? In layman’s terms, the self-attention mechanism allows the inputs to interact with each other (“self”) and find out who they should pay more attention to (“attention”). The outputs are aggregates of these interactions and attention scores.

例如下面這句話

The animal didn’t cross the street because it was too tired

我們可以思考一個問題,“it” 指代什麽?是 “street” 還是 “animal” ? 對人來說,很容易就能知道是 “animal”,但是對於演算法來說,並沒有這麽簡單。

模型處理單詞 “it” 時,Attention 允許將 “it” 和 “animal” 聯系起來。 當模型處理每個位置時,Attention 對不同位置產生不同的注意力,使其來更好的編碼當前位置的詞 如果你熟悉 RNN,就知道 RNN 如何根據之前的隱狀態信息來編碼當前詞。

當編碼 “it” 時,部分 Attention 集中於 “the animal”,並將其表示合併到 “it” 的編碼中。

Embedding向量

重要的前言, 輸入Self-attention模組的向量是Embedding向量 image

具有相似、相同類別、有關聯的詞彙在空間具有靠近的位置 此時就可以在使用數學的方式(cos, 內積等方式計算其關聯性)


Self-attention的運作方式

image

如何找出b1呢? 首先根據a1, 找出這個sequence和a1相關的向量, 相關的程度用α來表示

但是如何計算這個相關性α呢? The illustrations are divided into the following steps:

  1. Prepare inputs
  2. Initialise weights
  3. Derive key, query and value
  4. Calculate attention scores for Input 1
  5. Calculate softmax
  6. Multiply scores with values
  7. Sum weighted values to get Output 1
  8. Repeat steps 4–7 for Input 2 & Input 3

In practice, the mathematical operations are vectorised, i.e. all the inputs undergo the mathematical operations together. We’ll see this later in the Code section.


Q K V

image 圖書管(source)里有很多書(value),為了方便查找,我們給書做了編號(key)。當我們想要了解漫威(query)的時候,我們就可以看看那些動漫、電影、甚至二戰(美國隊長)相關的書籍。

為了提高效率,並不是所有的書都會仔細看,針對漫威來說,動漫,電影相關的會看的仔細一些(權重高),但是二戰的就只需要簡單掃一下即可(權重低)。

當我們全部看完後就對漫威有一個全面的了解了。

The key/value/query concept is analogous to retrieval systems. For example, when you search for videos on Youtube, the search engine will map your query (text in the search bar) against a set of keys (video title, description, etc.) associated with candidate videos in their database, then present you the best matched videos (values).

https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms


Step 1: Prepare inputs

We start with 3 inputs for this tutorial, each with dimension 4.

image

Input 1: [1, 0, 1, 0] 
Input 2: [0, 2, 0, 2]
Input 3: [1, 1, 1, 1]

Step 2: Initialise weights

Every input must have three representations (see diagram below).

image

These representations are called key (orange), query (red), and value (purple). For this example, let’s take that we want these representations to have a dimension of 3. Because every input has a dimension of 4, each set of the weights must have a shape of 4×3.

Note We’ll see later that the dimension of value is also the output dimension.

To obtain these representations, every input (green) is multiplied with a set of weights for keys, a set of weights for querys (I know that’s not the correct spelling), and a set of weights for values. In our example, we initialise the three sets of weights as follows.

Weights for key:
[[0, 0, 1],
 [1, 1, 0],
 [0, 1, 0],
 [1, 1, 0]]

Weights for query:
[[1, 0, 1],
 [1, 0, 0],
 [0, 0, 1],
 [0, 1, 1]]

Weights for value:
[[0, 2, 0],
 [0, 3, 0],
 [1, 0, 3],
 [1, 1, 0]]

Notes In a neural network setting, these weights are usually small numbers, initialised randomly using an appropriate random distribution like Gaussian, Xavier and Kaiming distributions. This initialisation is done once before training.

先隨機初始化QKV, 再經過Q得到K並且找到相對應的V, 並且反覆訓練得到最佳的QKV?

Step 3: Derive key, query and value

Now that we have the three sets of weights, let’s obtain the key, query and value representations for every input.

Key representation for Input 1: image

Use the same set of weights to get the key representation for Input 2: image

A faster way is to vectorise the above operations: image

Let’s do the same to obtain the value representations for every input: image

and finally the query representations: image

image

Notes In practice, a bias vector may be added to the product of matrix multiplication. image

Step 4: Calculate attention scores for Input 1

image

To obtain attention scores, we start with taking a dot product between Input 1’s query (red) with all keys (orange), including itself. Since there are 3 key representations (because we have 3 inputs), we obtain 3 attention scores (blue).

image

Notice that we only use the query from Input 1. Later we’ll work on repeating this same step for the other querys.

Note The above operation is known as dot product attention, one of the several score functions. Other score functions include scaled dot product and additive/concat.

image

Step 5: Calculate softmax

image

image

Note that we round off to 1 decimal place here for readability.

Step 6: Multiply scores with values

image

The softmaxed attention scores for each input (blue) is multiplied by its corresponding value (purple). This results in 3 alignment vectors (yellow). In this tutorial, we’ll refer to them as weighted values.

1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]
2: 0.5 * [2, 8, 0] = [1.0, 4.0, 0.0]
3: 0.5 * [2, 6, 3] = [1.0, 3.0, 1.5]

Step 7: Sum weighted values to get Output 1

image

Take all the weighted values (yellow) and sum them element-wise:

  [0.0, 0.0, 0.0]
+ [1.0, 4.0, 0.0]
+ [1.0, 3.0, 1.5]
-----------------
= [2.0, 7.0, 1.5]

The resulting vector [2.0, 7.0, 1.5] (dark green) is Output 1, which is based on the query representation from Input 1 interacting with all other keys, including itself.

Step 8: Repeat for Input 2 & Input 3

Now that we’re done with Output 1, we repeat Steps 4 to 7 for Output 2 and Output 3. I trust that I can leave you to work out the operations yourself

image

Notes The dimension of query and key must always be the same because of the dot product score function. However, the dimension of value may be different from query and key. The resulting output will consequently follow the dimension of value.

https://medium.com/%E7%A8%8B%E5%BC%8F%E5%B7%A5%E4%BD%9C%E7%B4%A1/autoencoder-%E4%BA%8C-rnn-lstm-seq2seq-attention-226bc239dfea

https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452

https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca

https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a

Pin-Jiun commented 1 year ago

image