Ali-H-Vahid / mm4w2v

Matching meaning using word to vec
0 stars 0 forks source link

Building translation Evidence #1

Open Ali-H-Vahid opened 9 years ago

Ali-H-Vahid commented 9 years ago
  1. Downloading the Europarl corpus for En- FR language pairs.
  2. learn w2v training algorithm on both side:
    • -size 200 -window 10 for FR
    • -size 800 -window 10 for EN
  3. stop word removal from En side using Terrier stop word list
  4. extracting 5000 most frequent words from En side
  5. Using commercial translator api for translating 5k most frequent words to FR
  6. extracting vectors of these 5000 words in both languages
  7. Using descriptions mentioned in "http://clic.cimec.unitn.it/~georgiana.dinu/down/" to learn Translation matrix
  8. Derive word- word translation models in both directions (FR-EN & EN-FR) for Europarl Corpus using GIZA++ toolkit with 5 HMM iteration, followed by 10 IBM Model 1 iterations, and ending with 5 IBM Model 4 iterations. Before training system, It is necessary to remove accent from French side and eliminate sentence pairs with a token ratio either smaller than 0.2 or larger than 8.
scortes-cngl commented 9 years ago

Step 1

The europarl corpus was downloaded from http://www.statmt.org/europarl/v7/fr-en.tgz.

The English and French parts of the bilingual corpus were tokenised using the Moses Sample Tokeniser Version 1.1 by Pidong Wang (https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl) with the language parameters set to en and fr respectively and the flag -no-escape activated.

The tokenised parts were truecased using the Moses Truecaser (https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/).

The results of this step can be found on demo-cngl:/home/scortes/projects/mm4w2v/data/europarl/*.true.

1.5 hours were invested.

scortes-cngl commented 9 years ago

Step 2

The C implementation of word2vec available on http://word2vec.googlecode.com/svn/trunk/ was used to compute the models. These were the parameters used:

  -cbow 1
  -size 800
  -window 10
  -negative 25
  -hs 0
  -sample 1e-4
  -threads 16
  -binary 1
  -iter 15
  -cbow 1
  -size 200
  -window 10
  -negative 25
  -hs 0
  -sample 1e-4
  -threads 16
  -binary 1
  -iter 15

The results of this step can be found on demo-cngl:/home/scortes/projects/mm4w2v/data/europarl/*.w2v_model.bin.

scortes-cngl commented 9 years ago

Step 3

The list of stop words was taken from the Terrier IR Platform version 4.0 (terrier-4.0/share/stopword-list.txt). The script used to remove the stop words is in the repository, on https://github.com/Ali-H-Vahid/mm4w2v/blob/master/scripts/remove_stopwords.py. The option -p that also removes non alphanumeric tokens was activated as well as the option -l that lowercases both words and stop words before comparing them.

The results of this step can be found on demo-cngl:/home/scortes/projects/mm4w2v/data/europarl/*.without_stopwords.

scortes-cngl commented 9 years ago

Step 4

The results of this step can be found on demo-cngl:/home/scortes/projects/mm4w2v/data/europarl/*.first_5k.

scortes-cngl commented 9 years ago

Step 5

The results of this step can be found on demo-cngl:/home/scortes/projects/mm4w2v/data/europarl/*.bing_translated_to_fr.