Closed Avmb closed 7 years ago
at which point is the crash? I'd rather keep Nematus working with the vocabularies that people are using, rather than requiring them to rebuild their vocabulary files.
if UNK is a word in the corpus then when the dicionary is built it get assigned an id different than 1. Then, when you run Nematus and there is an actually unknown word, it gets assigned id 1, which is incorrect, and it seems to cause crashes downstream (Rachel had crashes in embedding lookup during training, but I suppose it may also happen during translation).
I think that the change should be backward compatible with previous dictionaries.
I tried the tests scripts with the old dictionaries while introducing new words in the datasets and they all seem to work (test_translate.py yields a FAIL because it produces an incorrect translation when it sees an unknown word in the source sentence, but it does not crash).
I would consider mapping unknown words to 1 (which is hard-coded in data_iterator.py) correct behavior, and it's actually quite good that the word "UNK" is currently treated differently and can have a different ID - we don't want magic words that the system cannot translate properly. I think there may be a problem if n_words_src or n_words is None, and the network vocabulary size is calculated from the JSON vocabulary size (which is smaller than expected if UNK is overwritten).
My suggested fix is to redefine this behavior to automatically set the network vocabulary size to the highest index in the vocabulary file + 1.
Done.
Shall we keep the commit for the angular parentheses around <eos>
and <UNK>
? This way, if eos or UNK appear as words in the corpus, they will be treated as regular words without any interference with the special symbols.
hm, the proposed commit gets the max key, not the max index. I've pushed d9255b, which should fix this. To avoid interference with eos and UNK, we could just leave them out of the dictionary created in build_dictionary.py.
Reverts commit a6eeda7ed6037285627ca6e468a7ef8ab467034f (which reverted 09eb3594fb395d43dd3f6c2a4dec85af683cbb30) and adds angular parentheses around eos and unk. Nematus now actually uses unk, and if it is not set to 1 it crashes.