WorksApplications / SudachiPy

Python version of Sudachi, a Japanese tokenizer.
Apache License 2.0
388 stars 50 forks source link

fix: slow ubuild with word info split #158

Closed yokomotod closed 3 years ago

yokomotod commented 3 years ago

fix #157

Now it's 1 second to build a record, or even 100 records as well. This is same speed as building with word id split info.

$ sudachipy ubuild user.csv -o user.dic
reading the source file...1 words
writing the POS table...2 bytes
writing the connection matrix...4 bytes
building the trie...done
writing the trie...1028 bytes
writing the word-ID table...9 bytes
writing the word parameters...10 bytes
writing the word_infos...70 bytes
writing word_info offsets...4 bytes

real    0m0.935s
user    0m0.801s
sys 0m0.144s

user.csv:

舞台藝術,5146,5146,8000,舞台藝術,名詞,普通名詞,一般,*,*,*,ブタイゲイジュツ,舞台芸術,*,C,"舞台,名詞,普通名詞,一般,*,*,*,ブタイ/藝術,名詞,普通名詞,一般,*,*,*,ゲイジュツ","舞台,名詞,普通名詞,一般,*,*,*,ブタイ/藝術,名詞,普通名詞,一般,*,*,*,ゲイジュツ","舞台,名詞,普通名詞,一般,*,*,*,ブタイ/藝術,名詞,普通名詞,一般,*,*,*,ゲイジュツ",*
yokomotod commented 3 years ago

I think we already have test cases for split info parsing https://github.com/WorksApplications/SudachiPy/blob/7e5b501111920cef057b821412af595db248a7b8/tests/dictionarylib/test_userdictionarybuilder.py#L86

yokomotod commented 3 years ago

added fallback for https://github.com/WorksApplications/SudachiPy/pull/158#pullrequestreview-691328956

no so cool though...

katsutan commented 3 years ago

@yokomotod Thank you for the feedback. We have confirmed that the contents are correct, so merge them. Also. I am sorry to be late.