WorksApplications / SudachiPy

Python version of Sudachi, a Japanese tokenizer.
Apache License 2.0
392 stars 50 forks source link

`BufferError` when it tries to create tokenizer multiple times. #126

Closed himkt closed 4 years ago

himkt commented 4 years ago

Summary

Hello, thank you for maintaining the amazing text processing toolkit SudachiPy!

I found an error occurs when it tries to create tokenizer twice. I'm not sure what is the route cause but it doesnt' happen 28 days and does today. I suspect it related to some changes introduced in v0.4.6, as it doesn't reproduce in v0.4.5.

Reproduction

from sudachipy import dictionary, tokenizer
_tokenizer = dictionary.Dictionary().create()
_tokenizer = dictionary.Dictionary().create()  # memory leak here
FROM ubuntu:18.04

ENV LANG "ja_JP.UTF-8"
ENV PYTHONIOENCODING "utf-8"

RUN apt update -y && apt install -y language-pack-ja

# python
RUN apt update -y && apt install -y python3 python3-dev python3-pip

# konoha
WORKDIR /work

RUN pip3 install -U pip
RUN pip3 install 'SudachiPy==0.4.6'  # note: 0.4.5 works without any problem
RUN pip3 install 'sudachidict-core==20200330'
polm commented 4 years ago

Thanks for the bug report! This is definitely related to the Cythonization, sorry about that. I'll take a look at fixing it.

sorami commented 4 years ago

@himkt I have released v0.4.7 which includes the fix by @polm.

Please have a try with this version.

himkt commented 4 years ago

Thank you for the really quick response. @sorami @polm I confirmed that it works fine.

>>>
root@19933a16df1e:/work# python3
Python 3.6.9 (default, Apr 18 2020, 01:56:04)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sudachipy
>>> sudachipy.__version__
'0.4.7'
>>> from sudachipy import dictionary, tokenizer
>>> _tokenizer = dictionary.Dictionary().create()
>>> _tokenizer = dictionary.Dictionary().create()
sorami commented 4 years ago

Great to hear that it works! All credit goes to @polm .