WorksApplications / SudachiPy

Python version of Sudachi, a Japanese tokenizer.
Apache License 2.0
391 stars 50 forks source link

Tokenization fails because of UnicodeDecodeError for specific python versions #19

Closed mana-ysh closed 5 years ago

mana-ysh commented 6 years ago

Detail: https://github.com/WorksApplications/SudachiPy/issues/17#issuecomment-435671553

kazuma-t commented 5 years ago

I cannot reproduce with Python 3.6.4 on Ubuntu 18.04 (WSL)

import json
from sudachipy import tokenizer
from sudachipy import dictionary
from sudachipy import config

with open(config.SETTINGFILE, "r", encoding="utf-8") as f:
        settings = json.load(f)

lines = "地:🇯🇵日本・東京都\n▪️身長/体重:175cm/60kg\n▪️靴のサイズ:26,5\n\nTwitte"
dic = dictionary.Dictionary(settings)
tokenizer = dic.create()
tokenizer.tokenize(tokenizer.SplitMode.A, lines.strip())

I use the latest system_core.dic.

izziiyt commented 5 years ago

I probably fixed. Why I used "probably" is because I couldn't reproduce same error with same text. If same error occurs, reopen or make new issue.