What preprocessing steps were applied to the Wikipedias to train the word vectors for languages without clear word boundaries?

ageron commented 7 years ago

I would like to use fastText for languages that don't have clear word boundaries, such as Chinese, Japanese, Thai or Vietnamese. I have found various softwares to partition text from these languages into separate words, but I would like to use the same preprocessing steps that were used to generate the pre-trained word vectors. Unless I overlooked it, this does not seem to be documented.

Could you please provide these details (and perhaps add it to the documentation)?

Thanks a lot!

ageron commented 7 years ago

Some extra information, if needed.

These are the python packages I'm considering using:

jieba for Chinese
mecab for Japanese
pythai for Thai
pyvi for Vietnamese

I don't know Vietnamese, but it seems that this language often uses spaces even between syllables. For example, trying an example from pyvi's documentation:

>>> from pyvi.pyvi import ViTokenizer, ViPosTagger
>>> for token in ViTokenizer.tokenize(u"Trường đại học Bách Khoa Hà Nội").split():
...     print(token)
... 
Trường
đại_học
Bách_Khoa
Hà_Nội

It would seem that "đại học" shoud be treated as just one word. I looked for this "word" in the pre-trained vector text file for Vietnamese, but all I could find is "đại" and "học", separately. Perhaps that's not a big deal, but it feels like it would help to preprocess the Vietnamese wikipedia like this, don't you think?

Edit I used Google Translate, and it told me that this phrase means: "Hanoi University of Technology". But if I translate each word independently, I get: "bare long learn white lock huh pot". ;-) By grouping the words as pyvi proposes, I get: "bare University encyclopedia Hanoi", which is not perfect, but muuuuch better.

EdouardGrave commented 7 years ago

Hi @ageron

We applied word segmentation for the following languages:

Chinese: we used the Stanford Word Segmenter (with the CBT standard) ;
Japanese: we used Kuromoji ;
Bhutanese, Khmer, Lao, Thai and Tibetan: we used the Word Instance Break Iterator from Java ICU.

We did not do anything for languages where syllables are separated by white spaces, such as Vietnamese. It is possible to obtain vectors for over segmented words, by using

$ cat queries.txt | ./fasttext print-word-vectors model.bin

It would be interesting to investigate whether using word segmentation (as you suggested) lead to better representations or not.

ageron commented 7 years ago

Thanks a lot for the fast and clear response. If I find some time I might indeed try comparing sentence classification in Vietnamese with and without word segmentation.

I thing I'll submit a PR to add this information to the documentation, so I'll leave this issue open so I can reference it in the PR. Feel free to close it if you prefer.

cpuhrsch commented 7 years ago

Thanks for the pull request @ageron. We'll treat it separately from this issue.

ageron commented 7 years ago

My pleasure. :)

loretoparisi commented 7 years ago

@ageron @cpuhrsch I was trying to use a Unicode Transliteration using the junidecode java library - https://github.com/gcardone/junidecode

So having like Hindi:

हे हाहाहा ा हाहाहा हो ा हाहाहा हाहाहा हे .जोर क्लॉ जोर क्लॉ ारो जोर जोर क्लॉ जोर से हे छुटटे जानले ...नै नै नै कोनो थामानी कोथायी

you will get the transliterated text

he haahaahaa aa haahaahaa ho aa haahaahaa haahaahaa he .jore clo jore clo aaro jore jore clo jor se he chuttte jaanle ...nei nei nei kono thaamaanei kothaayy

or like Japanese

これ以上の地獄はないだろうと信じたかった
されど人類最悪の日はいつも唐突に

koreYi Shang noDi Yu hanaidaroutoXin zitakatuta saredoRen Lei Zui E noRi haitumoTang Tu ni

etc. and the same , Chinese, Korean, etc.

So, in this case, we should keep the word boundary for those languages that does not have with a default tokenizer. Will this approach work in FastText?

An alternative approach I was thinking to was a representation in terms of bytes, like BPE - https://github.com/rsennrich/subword-nmt/blob/master/apply_bpe.py, that is used recently for NMT models, and by FB FairSeq of course.

loretoparisi commented 7 years ago

@EdouardGrave do you think that kakasi could work fine? Also about the ICU break iterator, was this one http://icu-project.org/apiref/icu4j59rc/com/ibm/icu/text/BreakIterator.html?

Regarding my JavaScript wrapper I have found

For Chinese, https://github.com/hermanschaaf/jieba-js For Japanese, https://github.com/takuyaa/kuromoji.js All other word break, https://github.com/twitter/twitter-cldr-js (that implements the ICU BreakIterator as it is defined in Java ICU library).

massongit commented 6 years ago

@EdouardGrave

We applied word segmentation for the following languages:

Chinese: we used the Stanford Word Segmenter (with the CBT standard) ;

Japanese: we used Kuromoji ;

Bhutanese, Khmer, Lao, Thai and Tibetan: we used the Word Instance Break Iterator from Java ICU.

What dictionaries did you apply when using these applications?

polm commented 6 years ago

I think the Japanese vectors used Unidic. I checked this by loading the vectors with Gensim and checking the vocab member like so:

from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format('wiki.ja.bin')
'国会図書館' in model.vw.vocab # False
len(model.vw.vocab) # 580000

The number of unique tokens (not counting variations in POS/reading) in the version of Unidic I have handy is 569738 which basically lines up with the above (ipadic has <400k entries even before making them unique). 国会図書館 is also a single entry in ipadic but not unidic.

Would be nice to have some official confirmation though. The more recent paper Learning Word Vectors for 157 Languages (vectors here) names the tokenizers used but also fails to mention what dictionaries were used...

davidalbertonogueira commented 5 years ago

For people who wanna use the fastText pre-trained embeddings I think I would make sense to have a tokenizer class that wraps all of those tokenizers that were used to process the Wikipedia in a single place.

Namely, such tokenizer would receive the language and text and perform the sentence splitting and tokenization, using the same procedure that was used to train the embeddings to which the user wants to map their text.

loretoparisi commented 5 years ago

@davidalbertonogueira I think the best option is to add optional support to BPE, specifically FastBPE - see here https://github.com/facebookresearch/fastText/issues/725 For several reasons

it is natively in C++, there is no need to port to add it to FastText
since, well, yesterday :) it has a Cython wrapper, so it can be used by the FastText Python wrapper directly
it is language agnostic

davidalbertonogueira commented 5 years ago

@loretoparisi I will check that library, but from the looks of it, it appears to be a collection of decoupled shell scripts, therefore, targetting an offline pre-processing of datasets.

I'm not interested in training my own FT embeddings. My goal was to have something that would receive a pair (language, text) and tokenize it accordingly to the same process used for the fastText embeddings, in run-time/online fashion.

This assumes that I have fastText embeddings loaded into memory, and would handle everything after having the correct tokenization.

facebookresearch / fastText

What preprocessing steps were applied to the Wikipedias to train the word vectors for languages without clear word boundaries? #224