Tokenization fails because of UnicodeDecodeError for specific python versions

mana-ysh commented 6 years ago

Detail: https://github.com/WorksApplications/SudachiPy/issues/17#issuecomment-435671553

kazuma-t commented 5 years ago

I cannot reproduce with Python 3.6.4 on Ubuntu 18.04 (WSL)

import json
from sudachipy import tokenizer
from sudachipy import dictionary
from sudachipy import config

with open(config.SETTINGFILE, "r", encoding="utf-8") as f:
        settings = json.load(f)

lines = "地：🇯🇵日本・東京都\n▪️身長/体重：175cm／60kg\n▪️靴のサイズ：26,5\n\nTwitte"
dic = dictionary.Dictionary(settings)
tokenizer = dic.create()
tokenizer.tokenize(tokenizer.SplitMode.A, lines.strip())

I use the latest system_core.dic.

izziiyt commented 5 years ago

I probably fixed. Why I used "probably" is because I couldn't reproduce same error with same text. If same error occurs, reopen or make new issue.

WorksApplications / SudachiPy

Tokenization fails because of UnicodeDecodeError for specific python versions #19