WorksApplications / SudachiPy

Python version of Sudachi, a Japanese tokenizer.
Apache License 2.0
392 stars 50 forks source link

JoinKatakana plugin behaves differently from the Java version #162

Closed kazuma-t closed 3 years ago

kazuma-t commented 3 years ago

The JoinKatakana plugin always creates OOV nodes when concatenating nodes in concatenate_oov(). The Java version uses Lattice#getMinimumNode() to return the node with the lowest cost if there are nodes within the same range.

Sudachi (Java version)

=== Input dump:
オバケ
=== Lattice dump:
0: 9 9 (null)(0) BOS/EOS 0 0 0: 50 50 -739 -286 -944 211 -250 -163 -205 -852 -852 50 -739 -286 -944 211 -250 -852 -852 -955 50 -739 -286 -944 211 -250
1: 0 9 オバケ(816334) 名詞,普通名詞,一般,*,*,* 5139 5139 10000: 893
...
51: 0 3 オ(0) 感動詞,一般,*,*,*,* 5687 5687 15272: -640
52: 0 0 (null)(0) BOS/EOS 0 0 0: 0
=== Before rewriting:
0: 0 3 オ(185851) 67 5946 5946 5621
1: 3 9 バケ(233719) 3 5142 5142 3446
=== After rewriting:
0: 0 9 オバケ(816334) 3 5139 5139 10000
===
オバケ  名詞,普通名詞,一般,*,*,*        お化け
EOS

SudachiPy

=== Inupt dump:
オバケ
=== Lattice dump:
1: 9 9 (null)(0) BOS/EOS 0 0 0: 50 50 -739 -286 -944 211 -250 -163 -205 -852 -852 50 -739 -286 -944 211 -250 -852 -852 -955 50 -739 -286 -944 211 -250
2: 0 9 オバケ(816309) 名詞,普通名詞,一般,*,*,* 5139 5139 10000: 893
...
41: 0 0 (null)(0) BOS/EOS 0 0 0: 0
=== Before Rewriting:
0: 0 3 オ(185851) 5946 5946 5621￿
1: 3 9 バケ(233719) 5142 5142 3446￿
=== After Rewriting:
0: 0 9 オバケ(0) 0 0 0￿
===
オバケ  名詞,普通名詞,一般,*,*,*        オバケ
EOS
kazuma-t commented 3 years ago

Fixed in #163