Multi-Task Semantic Dependency Parsing with Policy Gradient for Learning Easy-First Strategies

cl-tohoku / showcase_miyawaki

0 stars 1 forks source link

Multi-Task Semantic Dependency Parsing with Policy Gradient for Learning Easy-First Strategies #5

Closed smiyawaki0820 closed 3 years ago

smiyawaki0820 commented 4 years ago

1. どんなもの？

（タスク）

Semantic Dependency Parsing (SDP): 意味的関係を acyclic graph で表現

（提案）

Iterative Predicate Selection (IPS) algorithm を提案
graph-based および transition-based parsing approach を統合したもの
IPS model を，multi-task および reinforcement learning によって訓練

（結果）

既存 SOTA model よりも高い性能
reinforcement learning が easy-first strategy を learning

2. 先行研究と比べてどこがすごい？

IPS algorithm

graph-based

複数の arc を持つ単語を扱えない
reinforcement learning が適用できない

transition-based

段階を踏んで fixing するため，error 伝搬が起こりやすい

（提案手法）graph-based + transition-based

複数 arc を持つ単語も考慮
reinforcement learning 適用可能により，transition paths の選択性を考慮（non-deterministic oracle problem に直面しにくい）することで，easy-first を実現
easy-first に解くことは，error 伝搬を抑制する効果が期待

3. 技術や手法のキモはどこ？

IPS algorithm

IPS model 🤔

Rewards of Policy Gradient

multi-task learning

SDP にはいくつかの linguistic formalisms が存在，overlap / synergy 箇所がある

4. どうやって有効だと検証した？

ablation によって multi-task および reinforcement learning の有効性を検証

main results

Arc length distributions for RL

5. 議論はある？

6. 次に読むべき論文は？

smiyawaki0820 commented 4 years ago

paper

Shuhei Kurita, Anders Søgaard, 2019 (ACL)

abstract

In Semantic Dependency Parsing (SDP), semantic relations form directed acyclic graphs, rather than trees. We propose a new iterative predicate selection (IPS) algorithm for SDP. Our IPS algorithm combines the graph-based and transition-based parsing approaches in order to handle multiple semantic head words. We train the IPS model using a combination of multi-task learning and task-specific policy gradient training. Trained this way, IPS achieves a new state of the art on the SemEval 2015 Task 18 datasets. Furthermore, we observe that policy gradient training learns an easy-first strategy.

task : SDP .... 意味的関係が直接的な acyclic graph を形成

bib

@inproceedings{kurita-sogaard-2019-multi,
     author = {Kurita, Shuhei  and S{\o}gaard, Anders},
     title = {Multi-Task Semantic Dependency Parsing with Policy Gradient for Learning Easy-First Strategies},
    booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
    year = {2019},
    url = {https://www.aclweb.org/anthology/P19-1232},
    doi = {10.18653/v1/P19-1232},
    pages = {2420--2430},
}

smiyawaki0820 commented 4 years ago

paper

https://www.aclweb.org/anthology/P19-1232/
Shuhei Kurita, Anders Søgaard

abstract

task: Semantic Dependency Parsing（SDP）

propose: 反復的述語選択（IPS）アルゴリズム

combines the graph-based and transition-based parsing approaches
複数の semantic head words を扱う

train IPS model

multi-task leaning
task-specific policy gradient

Result & Analysis

SOTA on the SemEval 2015
policy gradient learns an easy-first strategy

In Semantic Dependency Parsing (SDP), semantic relations form directed acyclic graphs, rather than trees. We propose a new iterative predicate selection (IPS) algorithm for SDP. Our IPS algorithm combines the graph-based and transition-based parsing approaches in order to handle multiple semantic head words. We train the IPS model using a combination of multi-task learning and task-specific policy gradient training. Trained this way, IPS achieves a new state of the art on the SemEval 2015 Task 18 datasets. Furthermore, we observe that policy gradient training learns an easy-first strategy.

bib

@inproceedings{kurita-sogaard-2019-multi,
    author = {Kurita, Shuhei  and S{\o}gaard, Anders},
    title = {Multi-Task Semantic Dependency Parsing with Policy Gradient for Learning Easy-First Strategies},
    booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
    year = {2019},
    publisher = {Association for Computational Linguistics},
    url = {https://www.aclweb.org/anthology/P19-1232},
    doi = {10.18653/v1/P19-1232},
    pages = {2420--2430},
}

smiyawaki0820 commented 4 years ago

1. Introduction

SDP

（初出）SemEval 2014 Task 8: Broad-Coverage Semantic Dependency Parsing
semantic 構造を直接的な非周期グラフを文に対して assign する
binary semantic relations (pred-arg)
2-directed arcs: 複数述語に渡って項になりうるもの（cannot apply standard DP-algorithms）

IPS algorithms

aim: 複数述語を考慮する
inspired by head selection algorithms for DP
以下二つの combination

parser	scoring	備考	特徴
transition-based	state 間の遷移に対して	段々と dep-graph を構築	error 伝搬（unsuitable: long）
graph-based	全 edge に対して	tree decoding algorithm を採用

提案する multi-task の algorithm は reinforcement learning に適用可能で，error 伝搬を抑制する
（モチ）SDP にはいくつもの言語的形式が存在
これらの間の overlap や synergies（相互作用）に有効
basic idea: Peng et al., 2017

Contributions

transition/graph -based approach を統合した SDP における新しい parsing algorithm の提案
本 parsing algorithm に対する multi-task learning が single-task よりも良い性能
task-specific policy gradient fine-tuning によってモデル改善
SOTA on 三つの formalism
policy gradient fine-tuning が easy-first に沿った学習をする

smiyawaki0820 commented 4 years ago

Related Work

transition-based parsing algorithms

train 中は gold transition path に従う
gold paths の state 情報しか得られず，error 伝搬は起こりやすい
McDonald and Nivre, 2007
Kiperwasser and Goldberg, 2016
Ballesteros et al.., 2015

graph-based parsing algorithms

全ての取りうる arc に対してスコアリングするよう学習
decoding algorithm は尤もらしい dep-path を選択する
error 伝搬がないが，reinforcement learning が適用できない
McDonald and Pereira, 2006
Zhang and Clark, 2008
Galley and Manning, 2009
Zhang et al., 2017

transition-based parsers w/ reinforcement learning

Zhang and Chan, 2009

applied SARSA (Baird Ⅲ, 1999) to an Arc-Standard model
pre-train された FNN を SARSA を用いて fine-tune することで update

Fried and Klein, 2018

apply policy gradient training to several constituency parsers (RNNG transition-based parser, etc ...)
（however）not always perform better than supervised learning
（hypothesize）due to credit assignment being difficult in transition-based parsing

Lee et al., 2018

iterative refinement approach in the context of sentence generation
（our model）explores multiple transition paths at once
（our model）avoids making risky decisions in the initial transitions, ある程度 inspired by
（our model）policy gradient training において早い段階での無関係な states からの sampling を避けるため，教師あり学習を用いて pre-train

smiyawaki0820 commented 4 years ago

Model

Iterative Predicate Selection (IPS)

（提案）new SDP-algorithm based on the head-selection algorithm (Zhang et al., 2017)

head-selection では，文に対して，各iteration ごとに単語w の head を fixing していく（future iteration での w については考えない）
SDP では, 必ずしも各単語が unique な head word （root を含む）を持っている訳ではない（複数 / none）
そこで，IPS parsing algorithm を提案：as a generalization of head-selection in SDP

提案アルゴリズム

w_i : i-th word
τ : time step
T_i : w_i における考えうる全ての transitions
ARC_{i,j} : a transition to create an arc from w_j (pred) to w_i (arg)
partial parse graph y^τ T_i = { NULL, ARC_{i,ROOT}, ARC_{i,1}, ... , ARC_{i,n} }
自身に戻る ARC_{i,i} は存在しない
新たに考慮される ARC は the set of arcs A^τ に含まれない T_i^τ（τ での w_i の transitions） = T_i（全ての transitions）- {ARC_{i,i} + A^τ}

how to create semantic dependency arcs

各単語 w_i に対して，transitions 候補 T_i^τ から head arc を選択
the partial semantic dep-graph を更新
if {全ての単語が NULL を選択} then {終了} else {1. へ}

non-deterministic oracle problem : there are several paths, depending on the orders of creating the arcs

In IPS parsing: path ごとに難易度が異なる

transition-based parsers は long-distance で error 伝搬が起こりやすい
easy-first approach : non-deterministic oracle problem に直面した時に some paths を優先させる
supervised ... Goldberg and Elhadad, 2010 ; Ma et al., 2013
unsupervised ... Spitkovsky et al., 2011

※ sequence tagger では，その effectiveness が proven されている

In this paper ...

learning a strategy for choosing transition paths using reinforcement learning
observe, an easy-first preference shown by this training

smiyawaki0820 commented 4 years ago

Neural Model

Sentence Encoder

3-BiLSTM: encoding words in sentences
（in）words + POS tags + lemmas ∈ 3p-dimensional vector u(w*)

Encoder of partial SDP graphs

updates the partial SDP graph y^τ stored in a semantic dep-matirx G^τ
一単語上での各 transition では，fills in one cell in a row（制限）
G=0 で初期化．(i-1)-th word が w_j の arg であれば G[i,j] = 1
G を rank-3 tensor G' ∈ R^{n×(n+1)×p} に変換
（by） token u(w*) の emb （3p-dims）と以下の g'ij とを入れ替えることによって
g_ij=1 の時はそのまま使う．=0 の時は u(w_NONE) に置き換える?
これによって作成した G' を single-BiLSTM で encode
最終的に，g_NULL の hidden state と concat することで partial SDP graph G^τ を取得

dep-flags: F'

直接的に，semantic dep-matrix を encode するもの
対応する arc が既に存在しているかどうかを表す
F' are also 3-rank tensor: f_ARC（g_i,j = 1）と f_NOARC（g_i,j = 0）の 2つの hidden-rep ∈ q-dim から構成
not encode these flags

Predicate Selection Model

encoder (a) からの hidden-rep（H）を input とする MLP ，および the partial semantic dep-graph（SDP rep: G^τ，dep-flags: F^τ）から成る
rank three tensors
concatenated at the 3rd axis

transition score:

w_i に対する transition t_j の確率（semantic head words: w_j）

For supervised learning ... cross entropy loss

l_i : gold transition label for i-th word
θ : all trainable parameters
（this supervised）not have a principled answer to the non-deterministic oracle problem
and ランダムに transition paths を those consistent with the gold annotations to create transition labels よりサンプリング

Labeling model

also develop a semantic dep-labeling NN

3-BiLSTMs
MLP: for predicting a semantic dep-label between words and predicates
use a MLP: that is a sum of the outputs from a 3-MLP, 2-MLP and a matrix muliplication
この MLP からの dim_out は semantic dep-labels の数に等しい *（input）[hi, hj]: w_i とその述語 w_j における BiLSTM からの hidden-rep

label l の score: from pred j to word i

教師あり学習によって softmax cross entropy loss を最小化する

smiyawaki0820 commented 4 years ago

Reinforcement Learning

Policy gradient

（Williams, 1992）

a method for learning to iteratively act according to a dynamic environment in order to optimize future rewards

the agent ~ NN model predicting the transition probabilities p_i(t_j^τ)

the environment ~ include the partial SDP graph y^τ

the rewards ~ computed by comparing the predicted parse graph to the gold parse graph y^g

objective function: the rewards を最大化

the transition policy for the w_i

transition の確率より given: π ~ p_i(t_j^τ|y^τ)
この勾配を計算する際，the expectation E_π を近似する
for any transition sequence w/ policy π からサンプリングされた a single transition path t

PG learning algorithm for SDP

How the cross entropy loss and the policy gradient loss are similar ?

No.	強化学習（policy gradient loss）	教師あり学習（cross entropy loss）
sampling of transitions	allow model to explore transition paths	never follow
decisions	dependent	independent（θ updated after finishing parsing）
loss	can be negative	non-negative

cross entropy loss はparser config が与えられた good transition の選択を最適化することが可能
一方で，policy gradient objective function は current policy に従って transition の全系列を最適化することが可能

Rewards for SDP

intermediate rewards（r_i^τ）: given during parsing, at dfferent τ

smiyawaki0820 commented 4 years ago

Implementation Details

pre-train our model using supervised learning when applying policy gradient
then using policy gradient for task-specific fine-tuning
fix the BiLSTM parameters during policy gradient
applying multi-task learning of the shared stacked BiLSTMs in supervised learning
task-specific MLPs: DM, PAS, PSD
train the shared BiLSTM using multi-task learning, then fine-tune the task-specific MLPs with policy gradient

smiyawaki0820 commented 4 years ago

Experiments

SemEval 2015 Task 18
WSJ: the development and in-domain test
Brown corpus: OOD

（比較実験） IPS + ML + RL

multi-task objective で訓練され，強化学習によって fine-tune された the IPS parser
その他 ablation
SOTA SDP parsers

（推論時）

apply heuristics to avoid predicting circles during decoding

smiyawaki0820 commented 4 years ago

Results

IPS + ML: small improvement
IPS + RL: better than IPS significantly (p < 10^-3)
IPS + ML + RL: the best
also OOD

Evaluating Our Parser w/o Lemma

baseline が lemma や syntactic info を使用してない

our model: still better
the lemma info does not improve the performance in PAS

Effect of Reinforcement Learning

Since long arcs are harder, an easy-first would amount to creating short arcs first
IPS + ML + RL: more likely learns an easy-first strategy（prone to connects neighboring words at the 1st transition）
Goldberg and Nivre, 2013 に比べて，parser に対して easy-first を hard-wire に組み込んでないが，それでも data から学習する（optimizes our long-term rewards）

syntactic order が not given なのに，順番的に解かれてる