Closed luc-vocab closed 2 years ago
spaCy uses pyThaiNLP for Thai tokenization.
>>> from pythainlp import word_tokenize,pos_tag
>>> nlp=lambda txt:pos_tag(word_tokenize(txt))
>>> doc=nlp("ผมจะไปประเทศไทยพรุ่งนี้ครับ")
>>> print(doc)
[('ผม', 'PPRS'), ('จะ', 'XVBM'), ('ไป', 'VACT'), ('ประเทศ', 'NCMN'), ('ไทย', 'NPRP'), ('พรุ่งนี้', 'DDAC'), ('ครับ', 'NCMN')]
And pyThaiNLP regards 'พรุ่งนี้' as a single word. So as spaCy-Thai.
I know thai_segmenter tokenizes another way:
>>> from thai_segmenter.tasks import tokenize_and_postag,get_segmenter
>>> nlp=lambda txt:tokenize_and_postag(txt,get_segmenter())
>>> doc=nlp("ผมจะไปประเทศไทยพรุ่งนี้ครับ")
>>> print(doc.pos)
[('ผม', 'PPRS'), ('จะ', 'XVBM'), ('ไป', 'VACT'), ('ประเทศไทย', 'NPRP'), ('พรุ่ง', 'NCMN'), ('นี้', 'DDAC'), ('ครับ', 'NPRP')]
The result of thai_segmenter seems what you want, however, spaCy does not use thai_segmenter as a tokenizer.
thank you @KoichiYasuoka for your response. when I run your code, I get the following output:
[('ผม', 'PPRS'), ('จะ', 'XVBM'), ('ไป', 'VACT'), ('ประเทศ', 'NCMN'), ('ไทย', 'NPRP'), ('พรรุ่งนนี้', 'DDAC'), ('ครับ', 'NCMN')]
the issue i'm trying to solve is why I get พรรุ่งนนี้ while as you get พรุ่งนี้ ? i'm a complete novice in Thai and I don't know the alphabet. Maybe พรรุ่งนนี้ is some kind of expanded form ?
thank you for any suggestion you may have.
I believe i have a bit more clarity on what's happening. This seems to be happening when I look at characters as printed on my terminal. When redirecting to a file, I do get พรุ่งนี้ as expected. I will update you if I have any more findings. Thank you for your work.
Well, @lucwastiaux, I recommend you to use Google Colaboratory (template here) with Chrome browser to avoid the terminal problem. You may re-open this issue if you find any other problems.
Hi, disclaimer: i'm a complete novice in Thai, and a complete novice with Spacy.
I see that one of the results is: พรรุ่งนนี้
in comparison, this page: https://rkcosmos.github.io/deepcut/ gives me: พรุ่ง | นี้
this result makes more sense to me as I can put it into google translate, and use a text to speech engine on it.
Can you help me understand why spacy-thai returns พรรุ่งนนี้ ? does it include several variants of those letters ? Again apologies as I'm just starting to learn Thai.