WorksApplications / SudachiPy

Python version of Sudachi, a Japanese tokenizer.
Apache License 2.0
392 stars 50 forks source link

No reading form for certain words #103

Closed sorami closed 5 years ago

sorami commented 5 years ago
>>> from sudachipy import tokenizer, dictionary
>>> tokenizer_obj = dictionary.Dictionary().create()
>>> [m.reading_form() for m in tokenizer_obj.tokenize("コンピュータ")]
['']
>>> [m.reading_form() for m in tokenizer_obj.tokenize("計算機")]
['ケイサンキ']

It should show the surface when the reading_form does not exist in the lexicon.

e.g., In the original Java implementation - dictionary/WordInfoList.java;

    WordInfo getWordInfo(int wordId) {

        ...

        String readingForm = bufferToString(buf);
        if (readingForm.isEmpty()) {
            readingForm = surface;
        }

        ...

    }

Thanks sig_m on the slack channel for reporting this!