WorksApplications / SudachiPy

Python version of Sudachi, a Japanese tokenizer.
Apache License 2.0
392 stars 50 forks source link

faster parsing #74

Closed izziiyt closed 3 years ago

izziiyt commented 5 years ago

guessed bottle necks are

izziiyt commented 5 years ago
polm commented 4 years ago

Hey, what's the status of this project? If I could help with Cythonizing SudachiPy I'd be glad to do so.

izziiyt commented 4 years ago

@polm I'm welcome your contribution 👍 I don't have enough time until Feb due to my own business work/private. So I think your contribution won't conflict with ours.

@sorami We should write contribution guidelines or coding guidelines to share our coding principal for PR including some amout of line changes like @polm may create. I want to discuss with you directly, I'll reach you via slack !

polm commented 4 years ago

Thanks, good to know! Not sure I'll get started on this before the new year but I'll definitely have time in January.

polm commented 4 years ago

Hello, I ended up not having time for this in January - sorry it took so long, but I'm looking at it now.

polm commented 4 years ago

Worked on this some more today, it's in the cython branch in my fork. Based on my benchmark processing time went from roughly 50s to 12s. (This is faster than I reported in my latest PR because that was using cProfile, which slowed things down.)

sorami commented 4 years ago

@polm , thank you, that sounds awesome!!

I probably won't have time for today and tomorrow, but let me check and merge in the next few days.

eiennohito commented 3 years ago

0.6.0 is ~30x faster