facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.83k stars 4.71k forks source link

Ukrainian tokenization bug: words with internal apostrophe #1265

Open glebm opened 2 years ago

glebm commented 2 years ago

The apostrophe (') is a normal letter in Ukranian (https://en.wikipedia.org/wiki/Rules_for_using_the_apostrophe_in_the_Ukrainian_language)

Example word: прив'язана

The tokenizer used by fastText appears to split this single word in 3: ["прив", "'", "язана"]

whysage commented 2 years ago

Hi, @glebm Can't reproduce.

test.py

import fasttext model = fasttext.train_unsupervised('data.txt', model='skipgram', minCount=1) print(model.words)

data.txt

Перші археоантропи на території сучасної України з'явилися в епоху раннього палеоліту, понад 900—800 тис. років тому. Слово прив'язана - як приклад. 1199 р. Роман Великий об'єднав Галичину і Волинь у єдину Галицько-Волинську державу.

out

Read 0M words Number of words: 34 Number of labels: 0 Progress: 100.0% words/sec/thread: 69646 lr: 0.000000 avg.loss: 4.149282 ETA: 0h 0m 0s ['', 'Великий', "прив'язана", '-', 'як', 'приклад.', '1199', 'р.', 'Роман', 'Слово', "об'єднав", 'Галичину', 'і', 'Волинь', 'у', 'єдину', 'Галицько-Волинську', 'державу.', 'Перші', 'тому.', 'років', 'тис.', '900—800', 'понад', 'палеоліту,', 'раннього', 'епоху', 'в', "з'явилися", 'України', 'сучасної', 'території', 'на', 'археоантропи']

glebm commented 2 years ago

Looking at the pretrained Ukrainian embeddings -- there are no words with an internal ' in them. Perhaps the published pretrained embeddings were trained with an older/different tokenizer?

wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.uk.300.vec.gz
gunzip cc.uk.300.vec.gz
grep "'" cc.uk.300.vec
' -0.0292 0.0219 0.3533 ...
whysage commented 2 years ago

As i see apostrophes are just omitted

grep "зобовязання" cc.uk.300.vec

зобовязання ... зобовязаннями ...

grep "з'явилися" cc.uk.300.vec

grep "зявилися" cc.uk.300.vec зявилися 0.0721 -0.0460 0.0400 -0.0369 ....

I can't remember words that have different meaning in Ukrainian with apostrophe and without it.

So maybe you can just remove apostrophes in your text in preprocessing step.

P.S. #StandWithUkraine

glebm commented 2 years ago

Ah, that makes sense. Do you know where the apostophe-omitting code is? Also, perhaps this caveat should be documented? Thanks!

StandWithUkraine!

whysage commented 2 years ago

Do you know where the apostophe-omitting code is?

Maybe it is outside the fastText.

https://github.com/facebookresearch/fastText/blob/main/docs/crawl-vectors.md

Tokenization We used the Stanford word segmenter for Chinese, Mecab for Japanese and UETsegmenter for Vietnamese. For languages using the Latin, Cyrillic, Hebrew or Greek scripts, we used the tokenizer from the Europarl preprocessing tools. For the remaining languages, we used the ICU tokenizer.

Also, perhaps this caveat should be documented?

I created pull request https://github.com/facebookresearch/fastText/pull/1268 Maybe some day it will be merged.