Open alireza202 opened 6 years ago
P_vocab is a distribution over the vocabulary words. So everything outside of the vocab has no mass.
But OOV is added to the vocabulary, right?
The vocab is fixed size throughout. I am looking into this issue in depth. The way OOV is predicted during decoding is really only meaning for during training, where the target sentence guides the prediction. During testing, because the oov words have no vector representation, and don't participate in the attention driven context, the model would have to use other available information. I suspect the model is leveraging the order of the oov words and the context information from their non-oov neighboring words.
In model.py num 163-164
extra_zeros = tf.zeros((self._hps.batch_size, self._max_art_oovs)) vocab_dists_extended = [tf.concat(axis=1, values=[dist, extra_zeros]) for dist in vocab_dists] # list length max_dec_steps of shape (batch_size, extended_vsize)
It padding the original vocab_dists with zeros tensor which is the P_vocab(OOV)
I find that in the encoding step ,the input of oov is represented as unk_word. According to the code , if an oov is copied by model, that's to say the unk_word embedding contribute a lot to that decoding step? I agree with bhomass that the model is leveraging the context of the oov.
Do you manually set the P_vocab(OOV) = 0 in your code somewhere? I can't seem to find such a thing. In your paper you said:
How would the P_vocab(OOV) be zero? If you don't set it manually to zero, it would not. What if OOV is selected (in the extended vocab) during decoding? Do you replace it during postprocessing?