Open mayhewsw opened 5 years ago
Hi,
Mmm can you reproduce the results on en-fr? It's possible that the new embeddings are not as good as the previous ones, I'll see if I can find the old ones again.
Trying fr-en as we speak. This should get 17.50, just from running the run.sh script, right?
No because the script uses way less data than in the paper (you can increase the amount of monolingual data to train the language model, it will significantly increase the performance), and the new fastText embeddings are different and I have not tested them. I was just hoping the new embeddings are better.
Also, if your goal is MT in general I would suggest looking at https://github.com/facebookresearch/XLM The new repo is a NMT model only, but with cross-lingual language model pretraining. It does not use PBSMT anymore, and works much better (> 33 BLEU in the end on en-fr).
Thanks for the pointer. I want to use it for a different cross-lingual task, so having the hard alignments from Moses is important.
OK, larger LM. I'll try that too.
I got 11.04 for fr-en. I will try with a larger language model.
I tried a large LM for de-en. The original LM was trained on 10M lines of text. I trained one on 100M lines, and the performance went up only slightly: 11.11.
Could it be the seed dictionary for the alignments? I had been using "identical_char", but I've seen much better results (13 BLEU) when I use "default" option, which uses a small dictionary. Maybe with the whole dictionary it gets to 15.
Interesting that you get such a difference with the seed dictionary over identical_char. In En-De identical_char should return almost as good dictionary (if not better) than with seed. Are you also using the iterative refinement procedure? What accuracy do you get on De-En / En-De word translation with seed / identical char? It should be very similar.
All my experiments are de-en.
For the identical_char dictionary, I did all 5 refinements, and the output of the best iteration (5) is:
Starting iteration 5... Building the train dictionary ... New train dictionary of 3844 pairs. Found 2814 pairs of words in the dictionary (1489 unique). 13 other pairs contained at least one unknown word (0 in lang1, 13 in lang2) 1489 source words - nn - Precision at k = 1: 62.525185 1489 source words - nn - Precision at k = 5: 77.098724 1489 source words - nn - Precision at k = 10: 81.329752 Found 2814 pairs of words in the dictionary (1489 unique). 13 other pairs contained at least one unknown word (0 in lang1, 13 in lang2) 1489 source words - csls_knn_10 - Precision at k = 1: 67.159167 1489 source words - csls_knn_10 - Precision at k = 5: 80.926797 1489 source words - csls_knn_10 - Precision at k = 10: 84.687710
For the seed dictionary ("default"), I only did one iteration, and the output of the best iteration (0) is:
Found 2814 pairs of words in the dictionary (1489 unique). 13 other pairs contained at least one unknown word (0 in lang1, 13 in lang2) 1489 source words - nn - Precision at k = 1: 66.756212 1489 source words - nn - Precision at k = 5: 81.061115 1489 source words - nn - Precision at k = 10: 85.963734 Found 2814 pairs of words in the dictionary (1489 unique). 13 other pairs contained at least one unknown word (0 in lang1, 13 in lang2) 1489 source words - csls_knn_10 - Precision at k = 1: 71.591672 1489 source words - csls_knn_10 - Precision at k = 5: 83.948959 1489 source words - csls_knn_10 - Precision at k = 10: 87.642713
FWIW, the supervised numbers are pretty close to those found in Table 1 of (Conneau 2017).
Regarding the vectors: the paper talks about n-grams in the vectors (with examples in Table 1). How did you get these? The common crawl vectors that the scripts in this repo download don't have n-gram phrases in them.
We used the word2phrase code of Mikolov to generate n-grams, then applied fastText on the generated corpus of n-grams: https://github.com/tmikolov/word2vec
Got it. Thanks. Any chance these vectors are available for download?
Unfortunately I don't have anything left of this data, sorry about that :/
@mayhewsw - did you get your PBSMT to work? how'd you do it?
I achieved a 13.5 BLEU on En-FR PBSMT, but only after changing to --dico_train default, so this uses a seed dictionary and isn't fully unsupervised. @glample, you mentioned the fasttext embeddings had been updated and you'd hoped they were better - do you have the old ones? I'll test this hypothesis.
Previously I had an issue with capitalization, but that was resolved (thanks!).
That said, I'm still not able to get 15.63 BLEU on de-en newstest2016. My best score is 10.9.
Test data:
These are my parameters (default from run.sh)
Is there anything I'm missing? Happy to provide more info if it helps.