Closed ageron closed 7 years ago
Some extra information, if needed.
These are the python packages I'm considering using:
I don't know Vietnamese, but it seems that this language often uses spaces even between syllables. For example, trying an example from pyvi's documentation:
>>> from pyvi.pyvi import ViTokenizer, ViPosTagger
>>> for token in ViTokenizer.tokenize(u"Trường đại học Bách Khoa Hà Nội").split():
... print(token)
...
Trường
đại_học
Bách_Khoa
Hà_Nội
It would seem that "đại học" shoud be treated as just one word. I looked for this "word" in the pre-trained vector text file for Vietnamese, but all I could find is "đại" and "học", separately. Perhaps that's not a big deal, but it feels like it would help to preprocess the Vietnamese wikipedia like this, don't you think?
Edit I used Google Translate, and it told me that this phrase means: "Hanoi University of Technology". But if I translate each word independently, I get: "bare long learn white lock huh pot". ;-) By grouping the words as pyvi proposes, I get: "bare University encyclopedia Hanoi", which is not perfect, but muuuuch better.
Hi @ageron
We applied word segmentation for the following languages:
We did not do anything for languages where syllables are separated by white spaces, such as Vietnamese. It is possible to obtain vectors for over segmented words, by using
$ cat queries.txt | ./fasttext print-word-vectors model.bin
It would be interesting to investigate whether using word segmentation (as you suggested) lead to better representations or not.
Thanks a lot for the fast and clear response. If I find some time I might indeed try comparing sentence classification in Vietnamese with and without word segmentation.
I thing I'll submit a PR to add this information to the documentation, so I'll leave this issue open so I can reference it in the PR. Feel free to close it if you prefer.
Thanks for the pull request @ageron. We'll treat it separately from this issue.
My pleasure. :)
@ageron @cpuhrsch I was trying to use a Unicode Transliteration using the junidecode
java library - https://github.com/gcardone/junidecode
So having like Hindi:
हे हाहाहा ा हाहाहा हो ा हाहाहा हाहाहा हे .जोर क्लॉ जोर क्लॉ ारो जोर जोर क्लॉ जोर से हे छुटटे जानले ...नै नै नै कोनो थामानी कोथायी
you will get the transliterated text
he haahaahaa aa haahaahaa ho aa haahaahaa haahaahaa he .jore clo jore clo aaro jore jore clo jor se he chuttte jaanle ...nei nei nei kono thaamaanei kothaayy
or like Japanese
これ以上の地獄はないだろうと信じたかった
されど人類最悪の日はいつも唐突に
koreYi Shang noDi Yu hanaidaroutoXin zitakatuta saredoRen Lei Zui E noRi haitumoTang Tu ni
etc. and the same , Chinese, Korean, etc.
So, in this case, we should keep the word boundary for those languages that does not have with a default tokenizer. Will this approach work in FastText?
An alternative approach I was thinking to was a representation in terms of bytes, like BPE - https://github.com/rsennrich/subword-nmt/blob/master/apply_bpe.py, that is used recently for NMT models, and by FB FairSeq of course.
@EdouardGrave do you think that kakasi
could work fine? Also about the ICU break iterator, was this one http://icu-project.org/apiref/icu4j59rc/com/ibm/icu/text/BreakIterator.html?
Regarding my JavaScript wrapper I have found
For Chinese, https://github.com/hermanschaaf/jieba-js For Japanese, https://github.com/takuyaa/kuromoji.js All other word break, https://github.com/twitter/twitter-cldr-js (that implements the ICU BreakIterator as it is defined in Java ICU library).
@EdouardGrave
We applied word segmentation for the following languages:
- Chinese: we used the Stanford Word Segmenter (with the CBT standard) ;
- Japanese: we used Kuromoji ;
- Bhutanese, Khmer, Lao, Thai and Tibetan: we used the Word Instance Break Iterator from Java ICU.
What dictionaries did you apply when using these applications?
I think the Japanese vectors used Unidic. I checked this by loading the vectors with Gensim and checking the vocab
member like so:
from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format('wiki.ja.bin')
'国会図書館' in model.vw.vocab # False
len(model.vw.vocab) # 580000
The number of unique tokens (not counting variations in POS/reading) in the version of Unidic I have handy is 569738 which basically lines up with the above (ipadic has <400k entries even before making them unique). 国会図書館 is also a single entry in ipadic but not unidic.
Would be nice to have some official confirmation though. The more recent paper Learning Word Vectors for 157 Languages (vectors here) names the tokenizers used but also fails to mention what dictionaries were used...
For people who wanna use the fastText pre-trained embeddings I think I would make sense to have a tokenizer class that wraps all of those tokenizers that were used to process the Wikipedia in a single place.
Namely, such tokenizer would receive the language and text and perform the sentence splitting and tokenization, using the same procedure that was used to train the embeddings to which the user wants to map their text.
@davidalbertonogueira I think the best option is to add optional support to BPE, specifically FastBPE - see here https://github.com/facebookresearch/fastText/issues/725 For several reasons
@loretoparisi I will check that library, but from the looks of it, it appears to be a collection of decoupled shell scripts, therefore, targetting an offline pre-processing of datasets.
I'm not interested in training my own FT embeddings. My goal was to have something that would receive a pair (language, text) and tokenize it accordingly to the same process used for the fastText embeddings, in run-time/online fashion.
This assumes that I have fastText embeddings loaded into memory, and would handle everything after having the correct tokenization.
I would like to use fastText for languages that don't have clear word boundaries, such as Chinese, Japanese, Thai or Vietnamese. I have found various softwares to partition text from these languages into separate words, but I would like to use the same preprocessing steps that were used to generate the pre-trained word vectors. Unless I overlooked it, this does not seem to be documented.
Could you please provide these details (and perhaps add it to the documentation)?
Thanks a lot!