nltk arc eagerまわり

IKKO-Ohta commented 7 years ago

リソース

Natural Language Toolkit http://www.nltk.org
Universal Dependency v2 http://universaldependencies.org
CoNLL 形式

ダウンロードしたものを文単位で分割して"../auto/univ_dep_train/*.txt"とした。 train[:1000]では学習はうまくいっているように見えるのにtrain[1000:2000]だとうまくいかない。 train[2000:2150]だとうまくっているように見える。

IKKO-Ohta commented 7 years ago

trainコマンドが生きて帰ってこないときのキーボードインタラプト：

---------------------------------------------------------------------------
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-12-2106eb9fd1ea> in <module>()
      8 # =>内部に素性のインデックスを抱えている[重みは抱えていない]
      9 # =>Libsvmを叩きに行く
---> 10 parser.train(graphs,"../model/full.md")
     11 with open("../model/full_parser.pkl","wb") as f:
     12         pickle.dump(parser,f)

/Users/OtaIkko/.pyenv/versions/anaconda3-4.0.0/lib/python3.5/site-packages/nltk/parse/transitionparser.py in train(self, depgraphs, modelfile)
    508                 self._create_training_examples_arc_std(depgraphs, input_file)
    509             else:
--> 510                 self._create_training_examples_arc_eager(depgraphs, input_file)
    511 
    512             input_file.close()

/Users/OtaIkko/.pyenv/versions/anaconda3-4.0.0/lib/python3.5/site-packages/nltk/parse/transitionparser.py in _create_training_examples_arc_eager(self, depgraphs, input_file)
    445                 b0 = conf.buffer[0]
    446                 features = conf.extract_features()
--> 447                 binary_features = self._convert_to_binary_features(features)
    448 
    449                 if len(conf.stack) > 0:

/Users/OtaIkko/.pyenv/versions/anaconda3-4.0.0/lib/python3.5/site-packages/nltk/parse/transitionparser.py in _convert_to_binary_features(self, features)
    320         unsorted_result = []
    321         for feature in features:
--> 322             self._dictionary.setdefault(feature, len(self._dictionary))
    323             unsorted_result.append(self._dictionary[feature])
    324 

KeyboardInterrupt:

IKKO-Ohta commented 7 years ago

うまくいく区間とうまくいかない区間が存在しているのが謎だ。・量的な問題ではない train[:1000]ではうまくいき、train[1000:2000]ではだめになる UDのドキュメントを読み直す

IKKO-Ohta commented 7 years ago

Universal dependencyへの対応は現在工事中で安定していない。 https://github.com/nltk/nltk/wiki/Dependency-Parsing 自分でやってもいいけど話の本筋ではなさそう。

https://github.com/nltk/nltk/issues/694 をみる限りconll形式に対応しているのは間違いなく、時期的にconll形式はconll2009 shared taskのことを言っているのかなと思う。で、そのshared taskのホームページで http://ufal.mff.cuni.cz/conll2009-st/ https://catalog.ldc.upenn.edu/LDC2012T03 https://catalog.ldc.upenn.edu/LDC2012T04 が対応するデータと考える。ダウンロードには何かしらの認証や会員登録が必要だけど、さすがにこれは利用可能では、という気がする。

IKKO-Ohta commented 7 years ago

PTBを https://github.com/ninjin/pennconverter に投げて勝手にconllを生成する

IKKO-Ohta commented 7 years ago

Pennconverter + penntreebankで動作確認。訓練プログラムを投げているところ。

IKKO-Ohta commented 7 years ago

データセット

Penn Treebank 2400文弱 train:test = 9:1

評価方法

the Labeled Attachment Score (LAS) / the Unlabeled Attachment Score (UAS) nltk scoreをそのまま利用

結果

	LAS	UAS
700	0.3979	0.3337
1400	0.4720	0.4032
2080	0.4900	0.4179

ひどい結果で、1400 => 2080の上がり幅が小さいことから、数を増やしてうまくいくのかどうかもわからない。しかしEDAは2000文くらいで90%弱出していたような。この数字はおかしいので、どこか間違っているような気がする。デバッグは今後の課題。

IKKO-Ohta / e2l