Closed mayhewsw closed 5 years ago
Hi,
We never use lowercasing, we use truecasing. See http://www.statmt.org/moses/?n=Moses.SupportTools#ntoc11 When you truecase a sentence, you only modify the first letter of the first word of the sentence.
For instance, if the sentence is: This is the cat of Peter .
the truecaser will output: this is the cat of Peter .
i.e. Peter
remains unchanged since it is usually found with an uppercase in the corpus, but this
will be lowercased since it usually is lowercased. This way, Peter has a cat .
will remain unaffected by the truecaser.
So you indeed need to have uppercase in your phrase-table, which you will have if you trained your embeddings on truecased (or simply tokenized) data.
Unfortunately, the directory were we uploaded our embeddings was recently removed so I had to link to other embeddings: https://github.com/facebookresearch/UnsupervisedMT/commit/55169b48f6bc19b9ee5c13dbc67b11f191e6c540
But it turns out these are lowercased. I updated the links to new embeddings, not lowercased, it should fix your problem: https://github.com/facebookresearch/UnsupervisedMT/commit/b7e81ea457998560d336d12a613c3d9cf61efd6b
Thanks for noticing this!
Thanks a lot! I'll try it out.
I'm trying to run the PBSMT model. As far as I can tell, run.sh doesn't deal with casing properly. The induced phrase-table is all lower-case, but the test text is never lower-cased, which means that many (upper-cased) words are untranslated.
For example, I'm trying to reproduce the German-English results. "Flüchtlinge" is untranslated, but "flüchtlinge" is present in the phrase-table. This results in BLEU of 8.70 on newstest16. Lowercasing train and test only gets 13.33 BLEU. I tried retraining the LM on lowercased English, and this improved to 14.52.
Is there an additional step you did to get around this? Thanks!