WorksApplications / SudachiPy

Python version of Sudachi, a Japanese tokenizer.
Apache License 2.0
392 stars 50 forks source link

Speed up execution by re-using unk info #117

Closed polm closed 4 years ago

polm commented 4 years ago

Based on my benchmark, time was 1m23s before this change and 1m8s after, which is about an 18% reduction in runtime.

The current code creates a new string and WordInfo object for every single LatticeNode. The WordInfo is only used if the word is an unk, and it's never modified. This change makes just one instance which all LatticeNodes return if they need to.

This is part of #74 .

sorami commented 4 years ago

Released it as v.0.4.4 :tada:

Release v0.4.4 · WorksApplications/SudachiPy

Available on PyPI too; https://pypi.org/project/SudachiPy/0.4.4/