WorksApplications / SudachiPy

Python version of Sudachi, a Japanese tokenizer.
Apache License 2.0
391 stars 50 forks source link

spaCy model accuracy significantly degraded from SudachiPy v0.4.6 #129

Closed hiroshi-matsuda-rit closed 4 years ago

hiroshi-matsuda-rit commented 4 years ago

@sorami @polm Could you research the reason of this difference between v0.4.5 and v0.4.6?

$ pip install -U sudachipy==0.4.7
$ python -m spacy evaluate ja_core_news_lg-2.3.1/ja_core_news_lg/ja_core_news_lg-2.3.1/ ja_gsd-ud-dev.ne.json

Time      1.24 s
Words     11887
Words/s   9596
TOK       91.53
POS       82.62
UAS       75.75
LAS       74.45
NER P     68.31
NER R     65.52
NER F     66.88
Textcat   0.00

$ pip install -U sudachipy==0.4.6
$ python -m spacy evaluate ja_core_news_lg-2.3.1/ja_core_news_lg/ja_core_news_lg-2.3.1/ ja_gsd-ud-dev.ne.json

Time      1.40 s
Words     11887
Words/s   8492
TOK       91.53
POS       82.62
UAS       75.75
LAS       74.45
NER P     68.31
NER R     65.52
NER F     66.88
Textcat   0.00

$ pip install -U sudachipy==0.4.5
$ python -m spacy evaluate ja_core_news_lg-2.3.1/ja_core_news_lg/ja_core_news_lg-2.3.1/ ja_gsd-ud-dev.ne.json

Time      1.35 s
Words     12121
Words/s   8990
TOK       97.67
POS       97.30
UAS       88.94
LAS       87.55
NER P     71.79
NER R     69.22
NER F     70.48
Textcat   0.00
hiroshi-matsuda-rit commented 4 years ago

This problem was reproduced on Mac OS 10.14.6, Windows10 update 1909, and WSL with python 3.8.

sorami commented 4 years ago

For v0.4.6 and v0.4.7 the major updates were only about Cythonization, so it the problem is within SudachiPy, I guess it is something to do with Cython.

Let us have a look.

polm commented 4 years ago

Thanks for the report. I found the cause of this, I screwed up in the Cythonization and connect costs were wrong. Just opened a PR with a fix.

hiroshi-matsuda-rit commented 4 years ago

I strongly recommend to add the spaCy evaluation step to CI tests. With spacy CLI and UD_Japanese-GSD v2.6-NE, you can do evaluations like as:

# prepare sudachipy module before executing below steps
$ pip install -U spacy sudachidict-core
$ python -m spacy download ja_core_news_md
$ python -m spacy evaluate ja_core_news_md ja_gsd-ud-test.ne.json

================================== Results ==================================

Time      1.29 s
Words     13053
Words/s   10131
TOK       98.11
POS       97.94
UAS       88.16
LAS       86.18
NER P     72.79
NER R     72.91
NER F     72.85
Textcat   0.00

The decline of TOK measure should be within 0.1%.

sorami commented 4 years ago

@hiroshi-matsuda-rit

I've merged @polm's fix and released v0.4.8.

Sorry for the degradation, yeah we should include the spaCy evaluation step in the CI #132 (or at least test with some paragraphs)

v0.4.7

$ pip install -U sudachipy==0.4.7
$ python -m spacy evaluate ja_core_news_md ja_gsd-ud-test.ne.json

================================== Results ==================================

Time      1.10 s
Words     12817
Words/s   11630
TOK       91.93
POS       82.06
UAS       75.81
LAS       73.98
NER P     69.52
NER R     70.77
NER F     70.14
Textcat   0.00

v0.4.8

$ pip install -U sudachipy==0.4.8
$ python -m spacy evaluate ja_core_news_md ja_gsd-ud-test.ne.json

================================== Results ==================================

Time      1.20 s
Words     13053
Words/s   10871
TOK       98.11
POS       97.94
UAS       88.16
LAS       86.18
NER P     72.79
NER R     72.91
NER F     72.85
Textcat   0.00
hiroshi-matsuda-rit commented 4 years ago

I just tested v0.4.8 and got the same result. Thank you for quick response!