Building translation Evidence

Ali-H-Vahid commented 9 years ago

Downloading the Europarl corpus for En- FR language pairs.
learn w2v training algorithm on both side:
- -size 200 -window 10 for FR
- -size 800 -window 10 for EN
stop word removal from En side using Terrier stop word list
extracting 5000 most frequent words from En side
Using commercial translator api for translating 5k most frequent words to FR
extracting vectors of these 5000 words in both languages
Using descriptions mentioned in "http://clic.cimec.unitn.it/~georgiana.dinu/down/" to learn Translation matrix
Derive word- word translation models in both directions (FR-EN & EN-FR) for Europarl Corpus using GIZA++ toolkit with 5 HMM iteration, followed by 10 IBM Model 1 iterations, and ending with 5 IBM Model 4 iterations. Before training system, It is necessary to remove accent from French side and eliminate sentence pairs with a token ratio either smaller than 0.2 or larger than 8.

scortes-cngl commented 9 years ago

Step 1

The europarl corpus was downloaded from http://www.statmt.org/europarl/v7/fr-en.tgz.

The English and French parts of the bilingual corpus were tokenised using the Moses Sample Tokeniser Version 1.1 by Pidong Wang (https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl) with the language parameters set to en and fr respectively and the flag -no-escape activated.

The tokenised parts were truecased using the Moses Truecaser (https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/).

The results of this step can be found on demo-cngl:/home/scortes/projects/mm4w2v/data/europarl/*.true.

1.5 hours were invested.

scortes-cngl commented 9 years ago

Step 2

The C implementation of word2vec available on http://word2vec.googlecode.com/svn/trunk/ was used to compute the models. These were the parameters used:

English model:

  -cbow 1
  -size 800
  -window 10
  -negative 25
  -hs 0
  -sample 1e-4
  -threads 16
  -binary 1
  -iter 15

French model:

  -cbow 1
  -size 200
  -window 10
  -negative 25
  -hs 0
  -sample 1e-4
  -threads 16
  -binary 1
  -iter 15

The results of this step can be found on demo-cngl:/home/scortes/projects/mm4w2v/data/europarl/*.w2v_model.bin.

scortes-cngl commented 9 years ago

Step 3

The list of stop words was taken from the Terrier IR Platform version 4.0 (terrier-4.0/share/stopword-list.txt). The script used to remove the stop words is in the repository, on https://github.com/Ali-H-Vahid/mm4w2v/blob/master/scripts/remove_stopwords.py. The option -p that also removes non alphanumeric tokens was activated as well as the option -l that lowercases both words and stop words before comparing them.

The results of this step can be found on demo-cngl:/home/scortes/projects/mm4w2v/data/europarl/*.without_stopwords.

scortes-cngl commented 9 years ago

Step 4

The results of this step can be found on demo-cngl:/home/scortes/projects/mm4w2v/data/europarl/*.first_5k.

scortes-cngl commented 9 years ago

Step 5

The results of this step can be found on demo-cngl:/home/scortes/projects/mm4w2v/data/europarl/*.bing_translated_to_fr.

Ali-H-Vahid / mm4w2v