JudePark96 commented 4 years ago

논문	Exclusive Hierarchical Decoding for Deep Keyphrase Generation
저자	Wang Chen, Hou Pong Chan, Piji Li, Irwin King
링크	https://arxiv.org/abs/2004.08511
학회	ACL 2020

1. 초록은 뭐라고 말하고 있어 ?

최근의 접근은 모델이 keyphrase 를 예측할 뿐만 아니라 keyphrase 의 수 또한 결정해야했다. 이러한 접근은 sequential decoding process 를 사용하고 있었다. 그런데, 이러한 process 는 keyphrase 에 존재하는 intrinsic hierarchical compositionality 를 무시한다. 더 나아가서 이전의 접근은 duplicated keyphrases 를 생성하는 경향이 있고 이는 computing resource 와 time 을 낭비한다.

이러한 문제점을 극복하기 위해 본 논문에서는 exclusive hierarchical decoding framework that includes a hierarchical decoding process and either a soft or a hard exclusion mechanism 을 제안한다.

JudePark96 commented 4 years ago

2. 주요 기여점은 뭐야 ?

Sequential decoding method 를 통하여 keyphrases 를 생성하는 건 위에서 언급했듯이 두 가지 문제점이 있다.

Ignoring intrinsic hierarchical compositionality -> keyphrase set 은 multiple keyphrases 로 구성되어있으며, 각각의 keyphrase 는 multiple words 로 구성되어있다는 뜻임.
Generating duplicated keyphrases -> post-processing rules 를 통하여 지울 수 있지만 time and computing resources 의 낭비임.

본 논문에서는 위의 문제들을 극복하기 위해 Novel exclusive hierarchical decoding framework 를 제안함.

Methodology

Figure 2 : exclusive hierarchical decoding 에 대한 그림. hi 는 i-th PD step 의 hidden state 임. h{i, j} 는 j-th WD hidden state 과 corresponding 함. [neopd] token 은 PD 가 끝나지 않았다는 뜻임. [eowd] token 은 WD 가 terminate 한다는 뜻임. [eopd] token 은 PD 가 끝났고 모든 decoding process 가 마무리되었다는 뜻임. [m1, ..., m{l_x}] 은 document 로부터 encoding 된 hidden states 를 의미함. PD-Attention 과 WD-Attention 은 각각 PD 와 WD 에서 사용되는 attention mechanism 임. \betai 는 i-th step 에서의 PD attention score 임. \hat{h}{i, j} 은 WD attentional vector 임. EL/ES 는 exclusive loss 또는 exclusive search 가 사용되었음을 의미함.

Sequential Encoder

Context-aware representation 을 얻기 위해 two-layered bi-directional GRU 를 encoder 로 사용한다.

Keyphrase Generation 의 많은 논문에서 GRU 를 encoder 로 사용하고 있다.

Phrase-level Decoder

Phrase-level decoder 는 uni-directional GRU 를 사용하였다.

Screen Shot 2020-08-04 at 12 49 59 AM

\widetilde{\h_{i-1, end}} 는 (i-1)-th PD step 에서 WD step 을 거친 attentional vector 이다. 수식에 따르면 WD step 을 거친 hidden state 를 phrase level decoder 의 next state 에 들어가며 이것이 recursive 하게 동작한다. 그리고 이 representation 과 encoder 의 representation 을 바탕으로 아래의 수식을 통해 PD attention score 를 추출한다.

Screen Shot 2020-08-04 at 1 10 54 AM

eq (3) 의 $h_i$ 는 PD 의 hidden state 이며 $W_1$ 은 parameter matrix, $m_n$ 은 encoder 의 representation 이다. 이 수식은 bi-linear transformation 으로 구성되어져있으며 softmax 를 통하여 (2) 와 같이 attention score 를 추출한다.

Word-level Decoder

Screen Shot 2020-08-05 at 2 15 55 PM

$i$ 는 PD-step, $j-1$ 은 WD-step 을 의미한다. GRU 에$h{i, j-1}$ 으로 연산하여 $h{i, j}$ 를 얻는다.

Screen Shot 2020-08-05 at 2 42 18 PM

특이점은 PD-Attention Score 를 통하여 WD-Attention Score 를 scale 하고 있다는 점이다.

Screen Shot 2020-08-05 at 2 46 20 PM

이를 통해 나온 hidden state 로 decoding 을 하는데 copy mechanism 을 사용한다.

WD Process 는 [eowd] token 이 나왔을 때 terminate 한다. hierarchical decoding 은 [eopd] token 이 나왔을 때 terminate 한다.

JudePark96 commented 4 years ago

3. 이전의 접근과는 뭐가 다른 것 같아 ?

One Size Does Not Fit All: Generating and Evaluating Variable Number of Keyphrases 같은 이전의 접근에서는 주로 decoding process 에 대한 contribution 이 많았다. 언급한 논문의 orthogonal regularization, semantic coverage 라는 contribution 또한 decoding process 에 관한 부분이었다. 하지만 여전히 decoding process 는 sequential process 로 진행된다는 한계점을 가지고 있었다.

본 논문에서는 이러한 한계점을 극복하기 위해 decoding process 를 hierarchical process 로 진행했다는 것이 main contribution 이라고 생각한다.

JudePark96 commented 4 years ago

4. 어떤 걸 제안할 수 있을까 ?

본 논문과 이전의 접근을 본 결과, decoding process 에 관한 contribution 이 주였다. 차별점을 주는 제안 사항은 아래와 같다고 생각한다.

Encoder 레이어의 교체 (BERT 등으로)
Evaluation Metric
문서 제목을 통한 특징 보강
데이터 집합 별 Multi-Task Learning

등이 있을 것 같다.

JudePark96 commented 4 years ago

5. 다음 논문은 무엇을 읽어야할까 ?

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

JudePark96 / paper-summaries

🚀 [2020] Exclusive Hierarchical Decoding for Deep Keyphrase Generation #2

Contents