question about tokenization result for พรุ่งนี้

KoichiYasuoka / spaCy-Thai

Dependency parser on Thai language

MIT License

24 stars 2 forks source link

question about tokenization result for พรุ่งนี้ #3

Closed luc-vocab closed 2 years ago

luc-vocab commented 2 years ago

Hi, disclaimer: i'm a complete novice in Thai, and a complete novice with Spacy.

import spacy_thai
nlp=spacy_thai.load()
doc=nlp("ผมจะไปประเทศไทยพรุ่งนี้ครับ")
for t in doc:
    print(f'[{t}]')
    print("\t".join([str(t.i+1),t.orth_,t.lemma_,t.pos_,t.tag_,"_",str(0 if t.head==t else t.head.i+1),t.dep_,"_","_" if t.whitespace_ else "SpaceAfter=No"]))

I see that one of the results is: พรรุ่งนนี้

in comparison, this page: https://rkcosmos.github.io/deepcut/ gives me: พรุ่ง | นี้

this result makes more sense to me as I can put it into google translate, and use a text to speech engine on it.

Can you help me understand why spacy-thai returns พรรุ่งนนี้ ? does it include several variants of those letters ? Again apologies as I'm just starting to learn Thai.

KoichiYasuoka commented 2 years ago

spaCy uses pyThaiNLP for Thai tokenization.

>>> from pythainlp import word_tokenize,pos_tag
>>> nlp=lambda txt:pos_tag(word_tokenize(txt))
>>> doc=nlp("ผมจะไปประเทศไทยพรุ่งนี้ครับ")
>>> print(doc)
[('ผม', 'PPRS'), ('จะ', 'XVBM'), ('ไป', 'VACT'), ('ประเทศ', 'NCMN'), ('ไทย', 'NPRP'), ('พรุ่งนี้', 'DDAC'), ('ครับ', 'NCMN')]

And pyThaiNLP regards 'พรุ่งนี้' as a single word. So as spaCy-Thai.

KoichiYasuoka commented 2 years ago

I know thai_segmenter tokenizes another way:

>>> from thai_segmenter.tasks import tokenize_and_postag,get_segmenter
>>> nlp=lambda txt:tokenize_and_postag(txt,get_segmenter())
>>> doc=nlp("ผมจะไปประเทศไทยพรุ่งนี้ครับ")
>>> print(doc.pos)
[('ผม', 'PPRS'), ('จะ', 'XVBM'), ('ไป', 'VACT'), ('ประเทศไทย', 'NPRP'), ('พรุ่ง', 'NCMN'), ('นี้', 'DDAC'), ('ครับ', 'NPRP')]

The result of thai_segmenter seems what you want, however, spaCy does not use thai_segmenter as a tokenizer.

luc-vocab commented 2 years ago

thank you @KoichiYasuoka for your response. when I run your code, I get the following output: [('ผม', 'PPRS'), ('จะ', 'XVBM'), ('ไป', 'VACT'), ('ประเทศ', 'NCMN'), ('ไทย', 'NPRP'), ('พรรุ่งนนี้', 'DDAC'), ('ครับ', 'NCMN')]

the issue i'm trying to solve is why I get พรรุ่งนนี้ while as you get พรุ่งนี้ ? i'm a complete novice in Thai and I don't know the alphabet. Maybe พรรุ่งนนี้ is some kind of expanded form ?

thank you for any suggestion you may have.

luc-vocab commented 2 years ago

I believe i have a bit more clarity on what's happening. This seems to be happening when I look at characters as printed on my terminal. When redirecting to a file, I do get พรุ่งนี้ as expected. I will update you if I have any more findings. Thank you for your work.

KoichiYasuoka commented 2 years ago

Well, @lucwastiaux, I recommend you to use Google Colaboratory (template here) with Chrome browser to avoid the terminal problem. You may re-open this issue if you find any other problems.