issue when tokenizing `기차`

ssherko commented 3 weeks ago

👋 seeing a strange behavior when tokenizing 기차:

Python 3.8.16 (default, May 23 2023, 15:12:05)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.12.2 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from kiwipiepy import Kiwi

In [2]: Kiwi().tokenize('기차')
Out[2]:
[Token(form='기차', tag='VA', start=0, len=2),
 Token(form='어', tag='EF', start=1, len=1)] # ??

I don't speak Korean, so please bear with me: is Token(form='어', ...) expected in the list?

I'm running this on MacOS Sonoma Version 14.7, with Python 3.8.16, and the versions of relevant packages are:

Name: kiwipiepy
Version: 0.16.2
----
Name: kiwipiepy_model
Version: 0.16.0

appreciate the help!

bab2min commented 3 weeks ago

Hi @ssherko, In short, if you intented 기차 as a noun("a train" in English), this is an analysis error. Because 기차 + -어/아 can be contracted to 기차, the form 기차 has an ambiguity. The VA tag in your result indicates adjectives and the EF tag indicates final ending (which finishes a sentence). Kiwi model analyzed the input 기차 incorrectly, as a sentence with single adjective and final ending ("It's surprisingly good": dictionary link), not a single noun ("a train": dictionary link) . If you input "기차를 타다"(to take a train), "기차" is analyzed as a noun(NNG tag indicates general nouns.) correctly, as follows:

[Token(form='기차', tag='NNG', start=0, len=2),
  Token(form='를', tag='JKO', start=2, len=1),
  Token(form='타', tag='VV', start=4, len=1),
  Token(form='다', tag='EC', start=5, len=1)]

This inconsistent result comes from ambiguity, especially for inputs consisting of only one word. For one-word input, the Kiwi model tends to analyze the input as a complete sentence rather than as a single noun. So if your inputs consist of only one noun, I recommend to skip tokenizing.

ssherko commented 3 weeks ago

@bab2min, thanks a lot for the detailed (and quick!) explanation. I understand that Korean is a whitespace-delimited language, so I'm assuming a simple .split() is enough to get an idea about the number of words in a sentence, correct? If that's the case then I'll do as you say and avoid tokenization unless it is a multi-word sentence.

bab2min commented 3 weeks ago

@ssherko Yes, we Korean usually just count words in a sentence based on whitespace. We use morphological analyzers (e.g. Kiwi) to obtain lemma of words. You can think of Korean morphological analysis as a process similar to stemming or lemmatization in English NLP. So if you want to count words, use str.split(). It's simple and much faster. But this method only works for well-spaced text. Korean text on the web often has some spacing omitted or incorrectly spaced, so str.split() may give inaccurate results in these cases. If you have to deal with messy text on the web, I recommend using Kiwi.space() to correct spacing and then str.split() to count words.

>>> kiwi = Kiwi()
>>> kiwi.space("띄어쓰기가아주엉망인텍스트") # (Textwithverymessyspacing)
'띄어쓰기가 아주 엉망인 텍스트' # (Text with very messy spacing)

ssherko commented 3 weeks ago

amazing. thank you so much. closing the issue then.

bab2min / kiwipiepy

issue when tokenizing `기차` #177