WorksApplications / SudachiPy

Python version of Sudachi, a Japanese tokenizer.
Apache License 2.0
392 stars 50 forks source link

How can we detect unknown words? #112

Closed fullflu closed 4 years ago

fullflu commented 4 years ago

Overview

I want to extract unknown words after tokenization. When using MeCab, unknown words can be detected by checking the value of feature (If feature[4] == '*', then the word is unknown).

I have a hypothesis that we could detect unknown words by using SudachiPy as follows:

Let m be a morpheme. Then,

  1. m.dictionary_id() is -1 if m is unknown, else 0.
  2. m.word_id() is 0 if m is unknown, else a natural number (greater than 0).

Is this hypothesis correct? Or are there other ways to detect unknown words?

Example

I checked this hypothesis by a simple example. ラッスンゴレライ may be unknown.

Code: (the version of sudachipy is '0.3.13')

from sudachipy import tokenizer
from sudachipy import dictionary

tokenizer_obj = dictionary.Dictionary().create()

for m in tokenizer_obj.tokenize("ラッスンゴレライ説明してね。"):
    print(f"surface: {m.surface()}")
    print(f"normalized: {m.normalized_form()}")
    print(f"dictionary id: {m.dictionary_id()}")
    print(f"word id: {m.word_id()}")
    print(f"dictrionary_form_word_id: {m.word_info.dictionary_form_word_id}")
    print("-----------")

Result

surface: ラッスンゴレライ
reading: 
normalized: ラッスンゴレライ
dictionary id: -1
word id: 0
dictrionary_form_word_id: -1
-----------
surface: 説明
reading: セツメイ
normalized: 説明
dictionary id: 0
word id: 684900
dictrionary_form_word_id: -1
-----------
surface: し
reading: シ
normalized: 為る
dictionary id: 0
word id: 67754
dictrionary_form_word_id: 78796
-----------
surface: て
reading: テ
normalized: て
dictionary id: 0
word id: 100900
dictrionary_form_word_id: -1
-----------
surface: ね
reading: ネ
normalized: ね
dictionary id: 0
word id: 117617
dictrionary_form_word_id: -1
-----------
surface: 。
reading: 
normalized: 。
dictionary id: 0
word id: 6619
dictrionary_form_word_id: -1
-----------
sorami commented 4 years ago

Hi, thank you for the inquiry! Let me explain (sorry, we should have had documentation about this);

is_oov() method to check unknown word

Actually, there is a morpheme.is_oov() method to check if a word is OOV (out-of-vocabulary) or not!

In [1]: from sudachipy import tokenizer, dictionary                                                                                                                                                                

In [2]: tokenizer_obj = dictionary.Dictionary().create()                                                                                                                                                           

In [3]: for m in tokenizer_obj.tokenize("ラッスンゴレライ説明してね。"): 
   ...:     print(f"surface: {m.surface()}") 
   ...:     print(f"is_oov: {m.is_oov()}") 
   ...:     print("-----------") 
   ...:                                                                                                                                                                                                            
surface: ラッスンゴレライ
is_oov: True
-----------
surface: 説明
is_oov: False
-----------
surface: し
is_oov: False
-----------
surface: て
is_oov: False
-----------
surface: ね
is_oov: False
-----------
surface: 。
is_oov: False
-----------

Command line -a option

When you use SudachiPy as a command line tool, there is a -a option that outputs extra information such as dictionary IDs, and also (OOV) at the end of the line if the word is OOV.

$ echo ラッスンゴレライ説明してね。 | sudachipy -a
ラッスンゴレライ    名詞,普通名詞,一般,*,*,*    ラッスンゴレライ    ラッスンゴレライ        -1  (OOV)
説明  名詞,普通名詞,サ変可能,*,*,*  説明  説明  セツメイ    0
し   動詞,非自立可能,*,*,サ行変格,連用形-一般    為る  する  シ   0
て   助詞,接続助詞,*,*,*,* て   て   テ   0
ね   助詞,終助詞,*,*,*,*  ね   ね   ネ   0
。   補助記号,句点,*,*,*,* 。   。   。   0
EOS

Your hypotheses

m.dictionary_id() is -1 if m is unknown, else 0.

The specification of the original Sudachi (Java) is that for the OOV, dictionary ID is a negative value. For the current implementation, it will return -1, but to be strict, it can be any negative number.

m.word_id() is 0 if m is unknown, else a natural number (greater than 0).

Similar to the above, the original specification is that the word ID for the OOV is undefined. Currently, the implementation returns 0, but it is not guaranteed in the future.

Misc

You mentioned that your SudachiPy version is 0.3.13. I strongly recommend you to upgrade the SudachiPy version to 0.4.x. It will process the text much faster, and a number of bugs are fixed.

You can update SudachiPy using pip command as follows;

$ pip install -U sudachipy
$ pip show sudachipy
Name: SudachiPy
Version: 0.4.2
...

Release information: Releases · WorksApplications/SudachiPy

Hope it helps. Enjoy tokenization!

fullflu commented 4 years ago

Thank you for your quick response! I understand the implementation related to my hypothesis.

Actually, there is a morpheme.is_oov() method to check if a word is OOV (out-of-vocabulary) or not!

In my version of sudachipy, morpheme.is_oov() method does not return any boolean value.

is_oov: <bound method LatticeNode.is_oov of <sudachipy.latticenode.LatticeNode object at 0x157d72dd8>>

However, I confirmed that this bug was fixed in the version 0.4.2. Therefore, I'm going to use is_oov() method. Thank you!

fullflu commented 4 years ago

By the way, is there any method to detect unknown words without tokenization? Just like this: is_oov('ラッスンゴレライ') returns True and is_oov('説明') returns False

sorami commented 4 years ago

However, I confirmed that this bug was fixed in the version 0.4.2.

Oh yeah, I just remembered that there was a bug and it was fixed in v0.4.1; Release Bug fix · WorksApplications/SudachiPy (#106).

sorami commented 4 years ago

By the way, is there any method to detect unknown words without tokenization?

Umm, there is no such method to do WITHOUT tokenization.


You can do something like this if you are okay with doing the tokenization.

def contains_oov(text): 
     morpheme_list = tokenizer_obj.tokenize(text) 
     return any([m.is_oov() for m in morpheme_list]) 
>>> contains_oov("ラッスンゴレライ")
True

>>> contains_oov("説明") 
False

You can look up the dictionary by yourself, to see if a word exists in the dictionary or not. However, the result probably will not be useful if you do not use it without tokenization.

For example, you can do something like this;

>>>  [tokenizer_obj._lexicon.get_word_info(word_id).__dict__ 
         for (word_id, length) 
         in tokenizer_obj._lexicon.lookup("説明".encode("utf-8"), 0)]

[{'surface': '説',
  'head_word_length': 3,
  'pos_id': 3,
  'normalized_form': '説',
  'dictionary_form_word_id': -1,
  'dictionary_form': '説',
  'reading_form': 'セツ',
  'a_unit_split': [],
  'b_unit_split': [],
  'word_structure': []},
 {'surface': '説明',
  'head_word_length': 6,
  'pos_id': 13,
  'normalized_form': '説明',
  'dictionary_form_word_id': -1,
  'dictionary_form': '説明',
  'reading_form': 'セツメイ',
  'a_unit_split': [],
  'b_unit_split': [],
  'word_structure': []}]

The dictionary does "common prefix search" for the text 説明.

If we do the same for ラッスンゴレライ , it still returns something, as the word is in the dictionary.

>>> [tokenizer_obj._lexicon.get_word_info(word_id).__dict__ 
        for (word_id, length) 
        in tokenizer_obj._lexicon.lookup("ラッスンゴレライ".encode("utf-8"), 0)]

[{'surface': 'ラ',
  'head_word_length': 3,
  'pos_id': 66,
  'normalized_form': '等',
  'dictionary_form_word_id': -1,
  'dictionary_form': 'ラ',
  'reading_form': 'ラ',
  'a_unit_split': [],
  'b_unit_split': [],
  'word_structure': []}]

(Well, you can check if any of the entries returned by the dictionary has length exactly the same as your input, to see if it's a "prefix" or a "whole text" ...)

If you want to know if a given text is an "unknown word" in an intuitive sense, you want to do tokenization to analyze the text.

sorami commented 4 years ago

Okay, there's an exact_match_search method in the Double Array Trie implementation we use (rixwew/darts-clone-python) for the lexicon (dictionary).

If you really want to check if a particular word is in the dictionary or not, you can do something like this;

>>> tokenizer_obj._lexicon.lexicons[0].trie.exact_match_search("説明".encode("utf-8"))                                                                                                                        
(12650852, 6)

>>> tokenizer_obj._lexicon.lexicons[0].trie.exact_match_search("ラッスンゴレライ".encode("utf-8"))                                                                                                            
(-1, 0)

The exact_match_search method will return (-1, 0) (word ID = -1, length = 0) if it cannot find the exact match in the dictionary.

fullflu commented 4 years ago

The exact_match_search method will return (-1, 0) (word ID = -1, length = 0) if it cannot find the exact match in the dictionary.

Awesome!! The method would also be helpful to me. I would be happy if you could add the description of detecting unknown words to the document in the future.

Thank you!