Concept of Semantic Embedding

Abstract

continuous vector representations of word를 계산하는 2가지 방법을 소개 및 기존의 best performing techniques와 비교

1. Introduction

기존의 많은 NLP system과 techniques는 word를 atomic units(as indices)로 다뤘다. 그 예시로 N-gram model이 있음. 그래서 단어들간의 유사도를 구한다는 개념이 존재하지 않았다.

하지만 이런 방식은 training data의 size of high quality에 의존성이 심했다. 기존의 corpora for many language는 오직 10억개의 words 정도만 포함하고 있었기 때문에 큰 발전을 이루지 못했다.

최근(2013년이라는것..) machine learning techniques가 발전함에 따라 복잡한 모델을 더 많은 데이터 셋을 이용해 훈련시킬 수 있고 이것은 위에서 언급한 simple model의 성능을 능가한다. 아마도 'distributed representations of words'라는 것이 매우 성공적인 개념일 것이다. 그 예시로 neural network based language models가 있음.

1.1 Goals of the Paper

이 논문의 주요 목적은 많은 데이터 셋으로 high-quality word vectors을 학습시키는 techniques를 소개하는 것이다. (이 논문 저자가 아는 선에서) 기존에 제안된 어떤 architecture도 a few hundred of millions of words에서 word vector의 차원이 50-100인 것을 성공적으로 학습시키지 못했다.

우리는 최근에 제안된 단순히 두 단어가 얼마나 유사한지 체크할 뿐만 아니라, 다양한 similarity를 측정할 수 있는 techniques를 사용할 것이다.

Using a word offset technique where simple algebraic operations are performed on the word vectors, it was shown for example that vector(”King”) - vector(”Man”) + vector(”Woman”) results in a vector that is closest to the vector representation of the word Queen

정리하면 여기 논문에서는

1) linear regularities among words를 지키는 새로운 모델 아키텍처를 개발함으로써 이러한 vector operations의 정확도를 높일 것이다. 2) syntactic과 semantic 규칙을 측정할 수 있는 test set을 새로 디자인 할 것이다. 3) training time과 정확도가 word vector의 차원과 training data의 양에 의존한다는 것을 보여줄 것이다.

1.2 Previous Work

1) NNLM 2) 다른 corpora를 써서 training하여 word vector를 평가했다.

2. Model Architectures

continuous representations of words를 estimating하기위해 다양한 모델들이 제안됐다. 그 예시로 LSA가 있다. 이 논문에서는 neural networks로 학습한 distributed representations of words에 포커스할 것이다. (LSA보다 linear regularities를 잘 지키고, LDA보다 연산량이 적으므로)

우리는 서로 다른 모델들을 비교하기 위해서 다음 2가지 방법을 쓴다.

1) 모델을 훈련시키기 위해서 계산해야하는 파라미터들의 개수를 계산. 2) 각 모델마다 정확도는 최대로하고 연산량은 최소로 만들려고 노력했음.

우리가 소개할 모델들은 아래와 비례하는 training complexity를 가지고 있음. O = E T Q E : epoch, T : the number of the words in the training set.

2.1 Feedforward Neural Net Language Model(NNLM)

이것은 input, projection, hidden layer, output layer(layer 4개)로 이뤄져 있음. N previous words 는 1-of-V coding으로 들어옴. (여기서 V란 단어의 총 사이즈를 의미한다)

P(projection layer)의 차원 : N D N : 각 시점마다 input으로 들어오는 단어의 갯수 D : word representation의 차원 H(hidden layer)의 차원 : 1 H output layer의 차원 : 1 * V

per each training Q = N D + N D H + H V 위 식에서 dominating term : H V .하지만, practical solution에서는 hierarchical versions of the softmax를 써서 이 연산을 H log2(V)로 줄이기도 한다. 따라서 가장 dominating term은 2번째 항이 된다.

논문에서 소개하는 모델은 hierarchical softmax를 Huffman binary tree를 쓰기 때문에 사실 log2(Unigram_perplexity(V))로 만들 수 있다. 이걸로 NDH의 연산량을 줄일 수는 없지만 어차피 우리 모델은 hidden layer가 없기 때문에 상관없다.

2.2 Recurrent Neural Net Language Model(RNNLM)

RNN based language model은 NNLM의 단점을 보완한 것이다. NNLM은 고정 길이의 context만 입력받을 수 있다는 한계점이 있었다. 또한 RNN은 shallow neural networks 보다 더 복잡한 패턴을 효과적으로 표현할 수 있다.

RNN 모델은 projection layer가 없고 오직 3개의 layer(input, hidden, output)만이 존재한다. 이 모델의 특징 중 하나는 recurrent matrix가 time-delayed connection의 역할을 해준다는 것이다. 따라서 과거의 정보가 hidden layer stat에서 표현될 수 있다.

complexity per training Q = H H + H V D : word representation의 차원( = H) H : hidden layer의 차원

여기서도 H V 가 H log2(V)로 reduced될 수 있다. 따라서 dominating term은 첫째 항이 된다.

2.3 Parallel Training of Neural Networks

우리는 DistBelief라는 large-scale distributed framework를 사용하였고 이러한 parallel training을 위해 mini-batch asynchronous gradient descent를 사용하였다. 그리고 adaptive learning rate procedure(Adagrad)를 사용하였다.

3. New Log-linear Models

여기서는 2가지 모델 아키텍쳐를 설명한다. 이전 section에서 complexity의 dominating term은 non-linear hidden layer로 인하여 발생했다. 우리는 neural network보다는 덜 정밀할 수도 있지만 훨씬 많은 양의 데이터를 효과적으로 훈련시킬 수 있는 간단한 모델을 소개한다.

초기모델들은 크게 2가지 단계를 거쳐 학습된다. 1) simple model을 사용하여 continuous word vectors를 학습한다. 2) 이러한 vector을 가지고 마지막 단계에서 학습을 시킨다.

3.1 Continuous Bag-of-Words Model

[논문 Figure 1 왼쪽그림]

이 모델의 특징중 하나는 words의 순서가 상관이 없다는 것이다. 그리고 미래의 단어를 또한 사용한다. 4개의 과거 단어와 4개의 미래 단어를 이용하는 것이 best performance였다. Training complexity는 다음과 같다.

Q = N D + D log2(V)

3.2 Continuous Skip-gram Model

[논문 Figure. 1 오른쪽 그림] increasing the range는 word vectors을 더 정확하게 하지만 연산량도 그만큼 많아진다. 먼 단어는 아무래도 중심 단어와 상관도가 적으니깐, sampling less하였다.

training complexity of this architecture는 아래와 비례한다. Q = C (D + D log2(V)) C: maximum distance of the words.

4. Results

we can ask: ”What is the word that is similar to small in the same sense as biggest is similar to big?”

by performing simple algebraic operations with the vector representation of words

we can simply compute vector X = vector(”biggest”)−vector(”big”) +vector(”small”).

4.1 Task Description

we define a comprehensive test set that contains five types of semantic questions, and nine types of syntactic questions. Two examples from each category are shown in Table 1. Overall, there are 8869 semantic and 10675 syntactic questions.

[논문 Table 1]

Question is assumed to be correctly answered only if the closest word to the vector computed using the above method is exactly the same as the correct word in the question; synonyms are thus counted as mistakes.

4.2 Maximization of Accuracy

[논문 Table 2] It can be seen that after some point, adding more dimensions or adding more training data provides diminishing improvements

4.3 Comparison of Model Architectures

[논문 Table 3]

[논문 Table 5]

Training a model on twice as much data using one epoch gives comparable or better results than iterating over the same data for three epochs, as is shown in Table 5, and provides additional small speedup.

4.4 Large Scale Parallel Training of Models

[논문 Table 6] Note that due to the overhead of the distributed framework, the CPU usage of the CBOW model and the Skip-gram model are much closer to each other than their single-machine implementations.

4.5 Microsoft Research Sentence Completion Challenge

[논문 Table 7]

6. Conclusion

1) We studied the quality of vector representations of words derived by various models on a collection of syntactic and semantic language tasks

2) We observed that it is possible to train high quality word vectors using very simple model architectures, compared to the popular neural network models (both feedforward and recurrent).

3) Using the DistBelief distributed framework, it should be possible to train the CBOW and Skip-gram models even on corpora with one trillion words

4) 여러 분야에서 활용기대

SundayWorkers / semantic-embedding