facebookresearch / UnsupervisedMT

Phrase-Based & Neural Unsupervised Machine Translation
Other
1.51k stars 262 forks source link

Reproducing PBSMT german-english #64

Open mayhewsw opened 5 years ago

mayhewsw commented 5 years ago

Previously I had an issue with capitalization, but that was resolved (thanks!).

That said, I'm still not able to get 15.63 BLEU on de-en newstest2016. My best score is 10.9.

Test data:

SRC_TEST=$PARA_PATH/dev/newstest2016-deen-src.de
TGT_TEST=$PARA_PATH/dev/newstest2016-deen-ref.en

These are my parameters (default from run.sh)

Defined parameters (per moses.ini or switch):
        config: /home/mayhew/IdeaProjects/UnsupervisedMT/PBSMT/moses_train_de-en/model/moses.ini 
        distortion-limit: 6 
        feature: UnknownWordPenalty WordPenalty PhrasePenalty PhraseDictionaryMemory name=TranslationModel0 num-features=2 path=/home/mayhew/IdeaProjects/UnsupervisedMT/PBSMT/moses_train_de-en/model/phrase-table.gz input-factor=0 output-factor=0 Distortion KENLM name=LM0 factor=0 path=/home/mayhew/IdeaProjects/UnsupervisedMT/PBSMT/data/en.lm.blm order=5 
        input-factors: 0 
        mapping: 0 T 0 
        threads: 48 
        weight: UnknownWordPenalty0= 1 WordPenalty0= -1 PhrasePenalty0= 0.2 TranslationModel0= 0.2 0.2 Distortion0= 0.3 LM0= 0.5 
line=UnknownWordPenalty
FeatureFunction: UnknownWordPenalty0 start: 0 end: 0
line=WordPenalty
FeatureFunction: WordPenalty0 start: 1 end: 1
line=PhrasePenalty
FeatureFunction: PhrasePenalty0 start: 2 end: 2
line=PhraseDictionaryMemory name=TranslationModel0 num-features=2 path=/home/mayhew/IdeaProjects/UnsupervisedMT/PBSMT/moses_train_de-en/model/phrase-table.gz input-factor=0 output-factor=0
FeatureFunction: TranslationModel0 start: 3 end: 4
line=Distortion
FeatureFunction: Distortion0 start: 5 end: 5
line=KENLM name=LM0 factor=0 path=/home/mayhew/IdeaProjects/UnsupervisedMT/PBSMT/data/en.lm.blm order=5

Is there anything I'm missing? Happy to provide more info if it helps.

glample commented 5 years ago

Hi,

Mmm can you reproduce the results on en-fr? It's possible that the new embeddings are not as good as the previous ones, I'll see if I can find the old ones again.

mayhewsw commented 5 years ago

Trying fr-en as we speak. This should get 17.50, just from running the run.sh script, right?

glample commented 5 years ago

No because the script uses way less data than in the paper (you can increase the amount of monolingual data to train the language model, it will significantly increase the performance), and the new fastText embeddings are different and I have not tested them. I was just hoping the new embeddings are better.

glample commented 5 years ago

Also, if your goal is MT in general I would suggest looking at https://github.com/facebookresearch/XLM The new repo is a NMT model only, but with cross-lingual language model pretraining. It does not use PBSMT anymore, and works much better (> 33 BLEU in the end on en-fr).

mayhewsw commented 5 years ago

Thanks for the pointer. I want to use it for a different cross-lingual task, so having the hard alignments from Moses is important.

OK, larger LM. I'll try that too.

mayhewsw commented 5 years ago

I got 11.04 for fr-en. I will try with a larger language model.

mayhewsw commented 5 years ago

I tried a large LM for de-en. The original LM was trained on 10M lines of text. I trained one on 100M lines, and the performance went up only slightly: 11.11.

mayhewsw commented 5 years ago

Could it be the seed dictionary for the alignments? I had been using "identical_char", but I've seen much better results (13 BLEU) when I use "default" option, which uses a small dictionary. Maybe with the whole dictionary it gets to 15.

glample commented 5 years ago

Interesting that you get such a difference with the seed dictionary over identical_char. In En-De identical_char should return almost as good dictionary (if not better) than with seed. Are you also using the iterative refinement procedure? What accuracy do you get on De-En / En-De word translation with seed / identical char? It should be very similar.

mayhewsw commented 5 years ago

All my experiments are de-en.

For the identical_char dictionary, I did all 5 refinements, and the output of the best iteration (5) is:

Starting iteration 5... Building the train dictionary ... New train dictionary of 3844 pairs. Found 2814 pairs of words in the dictionary (1489 unique). 13 other pairs contained at least one unknown word (0 in lang1, 13 in lang2) 1489 source words - nn - Precision at k = 1: 62.525185 1489 source words - nn - Precision at k = 5: 77.098724 1489 source words - nn - Precision at k = 10: 81.329752 Found 2814 pairs of words in the dictionary (1489 unique). 13 other pairs contained at least one unknown word (0 in lang1, 13 in lang2) 1489 source words - csls_knn_10 - Precision at k = 1: 67.159167 1489 source words - csls_knn_10 - Precision at k = 5: 80.926797 1489 source words - csls_knn_10 - Precision at k = 10: 84.687710

For the seed dictionary ("default"), I only did one iteration, and the output of the best iteration (0) is:

Found 2814 pairs of words in the dictionary (1489 unique). 13 other pairs contained at least one unknown word (0 in lang1, 13 in lang2) 1489 source words - nn - Precision at k = 1: 66.756212 1489 source words - nn - Precision at k = 5: 81.061115 1489 source words - nn - Precision at k = 10: 85.963734 Found 2814 pairs of words in the dictionary (1489 unique). 13 other pairs contained at least one unknown word (0 in lang1, 13 in lang2) 1489 source words - csls_knn_10 - Precision at k = 1: 71.591672 1489 source words - csls_knn_10 - Precision at k = 5: 83.948959 1489 source words - csls_knn_10 - Precision at k = 10: 87.642713

FWIW, the supervised numbers are pretty close to those found in Table 1 of (Conneau 2017).

mayhewsw commented 5 years ago

Regarding the vectors: the paper talks about n-grams in the vectors (with examples in Table 1). How did you get these? The common crawl vectors that the scripts in this repo download don't have n-gram phrases in them.

glample commented 5 years ago

We used the word2phrase code of Mikolov to generate n-grams, then applied fastText on the generated corpus of n-grams: https://github.com/tmikolov/word2vec

mayhewsw commented 5 years ago

Got it. Thanks. Any chance these vectors are available for download?

glample commented 5 years ago

Unfortunately I don't have anything left of this data, sorry about that :/

kellymarchisio commented 5 years ago

@mayhewsw - did you get your PBSMT to work? how'd you do it?

kellymarchisio commented 5 years ago

I achieved a 13.5 BLEU on En-FR PBSMT, but only after changing to --dico_train default, so this uses a seed dictionary and isn't fully unsupervised. @glample, you mentioned the fasttext embeddings had been updated and you'd hoped they were better - do you have the old ones? I'll test this hypothesis.