In data_generator, don't yield if the sentence cannot be encoded in the vocabulary

elliottd commented 9 years ago

We have a problem when yielding training examples that contain only one word. If that word is not in the vocabulary then there is essentially nothing to learn and so the example should not be yielded.

Traceback (most recent call last): File "train.py", line 138, in model.train_model() File "train.py", line 64, in train_model self.data_generator.yield_training_batch(): File "data_generator.py", line 146, in yield_training_batch description.split()) File "data_generator.py", line 326, in format_sequence seq_array) AssertionError: time 0 sequence kaffeebohnen len w_indices 0 seq_array [[ 0. 1. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] ..., [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.]]

elliottd commented 9 years ago

It's no longer clear to me that we should do this because it would have an unpredictable interaction with estimating a source language hidden vector for an image. However, we probably shouldn't add the word to the vocabulary if it is a singleton (especially if it's in the validation data).

elliottd commented 9 years ago

Fixed. We don't yield a sentence if it's encoding would be [~~, ]~~

elliottd / GroundedTranslation

In data_generator, don't yield if the sentence cannot be encoded in the vocabulary #5