Properly deal with eos and UNK.

EdinburghNLP / nematus

Open-Source Neural Machine Translation in Tensorflow

BSD 3-Clause "New" or "Revised" License

797 stars 269 forks source link

Properly deal with eos and UNK. #55

Closed Avmb closed 7 years ago

Avmb commented 7 years ago

Reverts commit a6eeda7ed6037285627ca6e468a7ef8ab467034f (which reverted 09eb3594fb395d43dd3f6c2a4dec85af683cbb30) and adds angular parentheses around eos and unk. Nematus now actually uses unk, and if it is not set to 1 it crashes.

rsennrich commented 7 years ago

at which point is the crash? I'd rather keep Nematus working with the vocabularies that people are using, rather than requiring them to rebuild their vocabulary files.

Avmb commented 7 years ago

if UNK is a word in the corpus then when the dicionary is built it get assigned an id different than 1. Then, when you run Nematus and there is an actually unknown word, it gets assigned id 1, which is incorrect, and it seems to cause crashes downstream (Rachel had crashes in embedding lookup during training, but I suppose it may also happen during translation).

I think that the change should be backward compatible with previous dictionaries.

Avmb commented 7 years ago

I tried the tests scripts with the old dictionaries while introducing new words in the datasets and they all seem to work (test_translate.py yields a FAIL because it produces an incorrect translation when it sees an unknown word in the source sentence, but it does not crash).

rsennrich commented 7 years ago

I would consider mapping unknown words to 1 (which is hard-coded in data_iterator.py) correct behavior, and it's actually quite good that the word "UNK" is currently treated differently and can have a different ID - we don't want magic words that the system cannot translate properly. I think there may be a problem if n_words_src or n_words is None, and the network vocabulary size is calculated from the JSON vocabulary size (which is smaller than expected if UNK is overwritten).

My suggested fix is to redefine this behavior to automatically set the network vocabulary size to the highest index in the vocabulary file + 1.

Avmb commented 7 years ago

Done. Shall we keep the commit for the angular parentheses around <eos> and <UNK>? This way, if eos or UNK appear as words in the corpus, they will be treated as regular words without any interference with the special symbols.

rsennrich commented 7 years ago

hm, the proposed commit gets the max key, not the max index. I've pushed d9255b, which should fix this. To avoid interference with eos and UNK, we could just leave them out of the dictionary created in build_dictionary.py.