How to deal with casing?

mayhewsw commented 5 years ago

I'm trying to run the PBSMT model. As far as I can tell, run.sh doesn't deal with casing properly. The induced phrase-table is all lower-case, but the test text is never lower-cased, which means that many (upper-cased) words are untranslated.

For example, I'm trying to reproduce the German-English results. "Flüchtlinge" is untranslated, but "flüchtlinge" is present in the phrase-table. This results in BLEU of 8.70 on newstest16. Lowercasing train and test only gets 13.33 BLEU. I tried retraining the LM on lowercased English, and this improved to 14.52.

Is there an additional step you did to get around this? Thanks!

glample commented 5 years ago

Hi,

We never use lowercasing, we use truecasing. See http://www.statmt.org/moses/?n=Moses.SupportTools#ntoc11 When you truecase a sentence, you only modify the first letter of the first word of the sentence.

For instance, if the sentence is: This is the cat of Peter . the truecaser will output: this is the cat of Peter . i.e. Peter remains unchanged since it is usually found with an uppercase in the corpus, but this will be lowercased since it usually is lowercased. This way, Peter has a cat . will remain unaffected by the truecaser.

So you indeed need to have uppercase in your phrase-table, which you will have if you trained your embeddings on truecased (or simply tokenized) data.

Unfortunately, the directory were we uploaded our embeddings was recently removed so I had to link to other embeddings: https://github.com/facebookresearch/UnsupervisedMT/commit/55169b48f6bc19b9ee5c13dbc67b11f191e6c540

But it turns out these are lowercased. I updated the links to new embeddings, not lowercased, it should fix your problem: https://github.com/facebookresearch/UnsupervisedMT/commit/b7e81ea457998560d336d12a613c3d9cf61efd6b

Thanks for noticing this!

mayhewsw commented 5 years ago

Thanks a lot! I'll try it out.

facebookresearch / UnsupervisedMT

How to deal with casing? #62