Open evanmiltenburg opened 7 years ago
Also, I set mask_zero
to True
in this example, but the Dense
layer right now doesn't seem to care about the padding value. The effect of this, I guess, is that the model right now doesn't mask out the embedding for the padding. Is that an issue? (For the same behavior as the current code, we could also just set mask_zero
to False
.)
The Keras docs say this about the Embedding
layer's mask_zero
argument:
mask_zero: Whether or not the input value 0 is a special "padding" value that should be masked out. This is useful for recurrent layers which may take variable length input. If this is True then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal |vocabulary| + 2).
I'm referring here to the Keras 1.0.6 docs, as the current docs are for Keras 2, but the GroundedTranslation README says it requires 1.0.7. (I couldn't find 1.0.7 docs, so 1.0.6 is the closest I could get.)
I think that the Embedding
layer did not exist when we first wrote this code.
If you can get the Embedding
layer to work, then please submit a pull request.
I tried to implement the change, but it didn't work. I don't think I understand the model as it is right now. Here's the relevant code:
def buildKerasModel(self, use_sourcelang=False, use_image=True):
'''
Define the exact structure of your model here. We create an image
description generation model by merging the VGG image features with
a word embedding model, with an RNN over the sequences.
'''
logger.info('Building Keras model...')
text_input = Input(shape=(self.max_t, self.vocab_size), name='text')
text_mask = Masking(mask_value=0., name='text_mask')(text_input)
# Word embeddings
wemb = TimeDistributed(Dense(output_dim=self.embed_size,
input_dim=self.vocab_size,
W_regularizer=l2(self.l2reg)),
name="w_embed")(text_mask)
drop_wemb = Dropout(self.dropin, name="wemb_drop")(wemb)
# Embed -> Hidden
emb_to_hidden = TimeDistributed(Dense(output_dim=self.hidden_size,
input_dim=self.vocab_size,
W_regularizer=l2(self.l2reg)),
name='wemb_to_hidden')(drop_wemb)
I have no idea what emb_to_hidden
does, or why it even works. Here's why:
w_embed
takes a one-hot vector and produces an embedding.wemb_drop
then randomly drops a proportion (self.dropin
) of the values during training, so as to make the model more robust.wemb_to_hidden
connects the embedding layer to the hidden layer.But then why is the input dimension for the last layer equal to self.vocab_size
? Shouldn't that be the same as embed_size
? I just tried this out, and the model still seems to work.
Also, the documentation for Embedding
says the following:
Turn positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
This layer can only be used as the first layer in a model.
So it can only work on a sequence, whereas the Dense
layer works on a single one-hot vector. Also, the documentation says that the difference between Dense
and Embedding
is that the latter does not have bias terms. That also means: fewer parameters. This seems like a desirable property, so I'd really like to get this to work. (Otherwise, we could just keep using the Dense layer.)
Embedding could also simplify the code:
TimeDistributed
.Here's what I tried:
text_input = Input(shape=(self.max_t, self.vocab_size), name='text')
# Word embeddings
wemb = Embedding(output_dim=self.embed_size,
input_dim=self.vocab_size,
W_regularizer=l2(self.l2reg),
mask_zero=True,
name="w_embed")(text_input)
drop_wemb = Dropout(self.dropin, name="wemb_drop")(wemb)
emb_to_hidden = TimeDistributed(Dense(output_dim=self.hidden_size,
input_dim=self.vocab_size,
W_regularizer=l2(self.l2reg)),
name='wemb_to_hidden')(drop_wemb)
This results in an error: Exception: Input 0 is incompatible with layer wemb_to_hidden: expected ndim=3, found ndim=4
. I think this has something to do with the way TimeDistributed
and batching work, but I don't understand what's going on.
I think the problem is that applying a TimeDistributed over the Embedding layer is producing a rank=4 tensor. The Embedding layer already returns a rank=3 tensor (batch size, length, features) so you don't need a TimeDistributed over that.
You might have found a bug with the input to emb_to_hidden having an input dimensionality of vocab_size. I don't understand why that works either.
On Mon, 25 Sep 2017 at 17:31 Emiel van Miltenburg notifications@github.com wrote:
I tried to implement the change, but it didn't work. I don't think I understand the model as it is right now. Here's the relevant code:
def buildKerasModel(self, use_sourcelang=False, use_image=True): ''' Define the exact structure of your model here. We create an image description generation model by merging the VGG image features with a word embedding model, with an RNN over the sequences. ''' logger.info('Building Keras model...') text_input = Input(shape=(self.max_t, self.vocab_size), name='text') text_mask = Masking(mask_value=0., name='text_mask')(text_input) # Word embeddings wemb = TimeDistributed(Dense(output_dim=self.embed_size, input_dim=self.vocab_size, W_regularizer=l2(self.l2reg)), name="w_embed"
)(text_mask) drop_wemb = Dropout(self.dropin, name="wemb_drop")(wemb)
# Embed -> Hidden emb_to_hidden = TimeDistributed(Dense(output_dim=self.hidden_size, input_dim=self.vocab_size, W_regularizer=l2(self.l2reg)), name='wemb_to_hidden')(drop_wemb)
I have no idea what emb_to_hidden does, or why it even works. Here's why:
- I assume that w_embed takes a one-hot vector and produces an embedding.
- I assume that wemb_drop then randomly drops a proportion (self.dropin) of the values during training, so as to make the model more robust.
- I assume that wemb_to_hidden connects the embedding layer to the hidden layer.
But then why is the input dimension for the last layer equal to self.vocab_size? Shouldn't that be the same as embed_size? I just tried this out, and the model still works.
Also, the documentation for Embedding says the following:
Turn positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
This layer can only be used as the first layer in a model.
So it can only work on a sequence, whereas the Dense layer works on a single one-hot vector. Also, the documentation says that the difference between Dense and Embedding is that the latter does not have bias terms. That also means: fewer parameters. This seems like a desirable property, so I'd really like to get this to work. (Otherwise, we could just keep using the Dense layer.)
Embedding could also simplify the code:
- It takes care of masking, so we don't need the Masking layer.
- It takes care of dropout, so we don't need the dropout layer.
- It works on sequences out of the box, so I don't think it needs to be wrapped by TimeDistributed.
Here's what I tried:
text_input = Input(shape=(self.max_t, self.vocab_size), name='text') # Word embeddings wemb = Embedding(output_dim=self.embed_size, input_dim=self.vocab_size, W_regularizer=l2(self.l2reg), mask_zero=True, name="w_embed")(text_input) drop_wemb = Dropout(self.dropin, name="wemb_drop")(wemb) emb_to_hidden = TimeDistributed(Dense(output_dim=self.hidden_size, input_dim=self.vocab_size, W_regularizer=l2(self.l2reg)), name='wemb_to_hidden')(drop_wemb)
This results in an error: Exception: Input 0 is incompatible with layer wemb_to_hidden: expected ndim=3, found ndim=4. I think this has something to do with the way TimeDistributed and batching work, but I don't understand what's going on.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/elliottd/GroundedTranslation/issues/29#issuecomment-331937695, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWqf4zC22L9lAAunmRRYY8pQHKB-p3Iks5sl9VngaJpZM4Phsw_ .
So I tried to do what we discussed earlier (remove TimeDistributed and use the Embedding directly), but it still gives the same error: Exception: Input 0 is incompatible with layer rnn: expected ndim=3, found ndim=4
. Here's the relevant code:
text_input = Input(shape=(self.max_t, self.vocab_size), name='text')
# Word embeddings
wemb = Embedding(output_dim=self.embed_size,
input_dim=self.vocab_size,
input_length=self.max_t,
W_regularizer=l2(self.l2reg),
mask_zero=True,
name="w_embed")(text_input)
The LSTM part:
logger.info("Building an LSTM")
rnn = InitialisableLSTM(output_dim=self.hidden_size,
input_dim=self.hidden_size,
return_sequences=True,
W_regularizer=l2(self.l2reg),
U_regularizer=l2(self.l2reg),
name='rnn')([wemb, rnn_initialisation])# was ([emb_to_hidden, rnn_initialisation])
I'm using hidden_size=300 for the time being, but ideally this would be customizable as well. That would require another layer between Embedding and InitialisableLSTM.
A Keras tutorial [1] says the input to an embedding layer should be 2D. Is our text input already 3D?
If it is already 3D, you'll need to change the data generator to yield appropriate 2D inputs.
Perhaps the tutorial from above can offer some useful guidance.
[1] https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
On Tue, 31 Oct 2017, 12:44 Emiel van Miltenburg, notifications@github.com wrote:
So I tried to do what we discussed earlier (remove TimeDistributed and use the Embedding directly), but it still gives the same error: Exception: Input 0 is incompatible with layer rnn: expected ndim=3, found ndim=4. Here's the relevant code:
text_input = Input(shape=(self.max_t, self.vocab_size), name='text') # Word embeddings wemb = Embedding(output_dim=self.
embed_size, input_dim=self.vocab_size, input_length=self.max_t, W_regularizer=l2(self.l2reg), mask_zero=True, name="w_embed")(text_input)
The LSTM part:
logger.info("Building an LSTM") rnn = InitialisableLSTM(output_dim=self.hidden_size, input_dim=self.hidden_size, return_sequences=True, W_regularizer=l2(self.l2reg), U_regularizer=l2(self.l2reg), name='rnn')([wemb, rnn_initialisation])# was ([emb_to_hidden, rnn_initialisation])
I'm using hidden_size=300 for the time being, but ideally this would be customizable as well. That would require another layer between Embedding and InitialisableLSTM.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/elliottd/GroundedTranslation/issues/29#issuecomment-340750939, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWqf7wQCGBOCyjj8k1TI2N6-FQvA3_aks5sxxYqgaJpZM4Phsw_ .
I would say it is 2D, or at least the code says it is. Or am I misunderstanding the shape-argument?
text_input = Input(shape=(self.max_t, self.vocab_size), name='text')
text_mask = Masking(mask_value=0., name='text_mask')(text_input)
I'll just use this thread for notes of things that I don't understand.
Why does format_sequence
skip tokens that don't exceed the UNK threshold? That results in ungrammatical descriptions like "I saw the yesterday" which are used to train the system. I can imagine that having an UNK token does help with the quality of the descriptions because at least the system is aware that there should be SOME word in there.
I think that replicates some other implementations that discarded the tokens completely. It probably makes more sense to add an UNK token to the vocabulary and use the UNK threshold to substitute the tokens.
Also, it's easier for me to track these different issues if you create different issues. I might forget that we discussed the format_sequence and UNK token in this thread about the Embedding layer.
Also, it's easier for me to track these different issues if you create different issues. I might forget that we discussed the format_sequence and UNK token in this thread about the Embedding layer.
Ok, I opened #31 for this.
I'm not sure how to initialize the output layers. Or at least: I don't know what to do with the size of the hidden layer. Here's the current code:
# Recurrent layer
if self.gru:
logger.info("Building a GRU")
rnn = InitialisableGRU(output_dim=self.hidden_size,
input_dim=self.hidden_size,
return_sequences=True,
W_regularizer=l2(self.l2reg),
U_regularizer=l2(self.l2reg),
name='rnn')([emb_to_hidden, rnn_initialisation])
else:
logger.info("Building an LSTM")
rnn = InitialisableLSTM(output_dim=self.hidden_size,
input_dim=self.hidden_size,
return_sequences=True,
W_regularizer=l2(self.l2reg),
U_regularizer=l2(self.l2reg),
name='rnn')([emb_to_hidden, rnn_initialisation])
output = TimeDistributed(Dense(output_dim=self.vocab_size,
input_dim=self.hidden_size,
W_regularizer=l2(self.l2reg),
activation='softmax'),
name='output')(rnn)
Because the output layer has self.hidden_size
as its input_dim
, we can't put embeddings.T
there as the weights because the dimensions don't line up. Two options:
self.embed_size
, so that the output layer has self.embed_size
as its input dimensions.TimeDistributed(Dense(...))
in between the LSTM/RNN and the current output
layer to map the self.hidden_size
to self.embed_size
. Then we can make the output layer have the correct dimensions.I think you want option 2. See Lior and Wolf 2017 for some more details on tying the embeddings.
models.py
importsEmbedding
on line 4, but it never seems to be used. Instead, the code to get word embeddings is this:If I understand this correctly, the Dense layer takes a one-hot vector as input, and learns to output embeddings for each word. I think this could be rewritten as:
Or, slightly more verbose:
But seeing as the code once made use of
Embedding
and then it was dropped again, I'm a bit suspicious. Was there a reason for this layer to be replaced withDense
?(
Embedding
s also have adropout
keyword argument. So I guess thedrop_wemb
could be removed as well, if we passself.dropin
as a keyword argument. That would make the code more concise.)