elliottd / GroundedTranslation

Multilingual image description
https://staff.fnwi.uva.nl/d.elliott/GroundedTranslation/
BSD 3-Clause "New" or "Revised" License
46 stars 25 forks source link

Embedding layer #29

Open evanmiltenburg opened 7 years ago

evanmiltenburg commented 7 years ago

models.py imports Embedding on line 4, but it never seems to be used. Instead, the code to get word embeddings is this:

wemb = TimeDistributed(Dense(output_dim=self.embed_size,
                             input_dim=self.vocab_size,
                             W_regularizer=l2(self.l2reg)),
                       name="w_embed")(text_mask)

If I understand this correctly, the Dense layer takes a one-hot vector as input, and learns to output embeddings for each word. I think this could be rewritten as:

wemb = TimeDistributed(Embedding(output_dim=self.embed_size,
                                 input_dim=self.vocab_size + 2,
                                 mask_zero=True,
                                 W_regularizer=l2(self.l2reg)),
                       name="w_embed")(text_mask)

Or, slightly more verbose:

embedding_layer = Embedding(output_dim=self.embed_size,
                            input_dim=self.vocab_size + 2,
                            mask_zero=True,
                            W_regularizer=l2(self.l2reg))
wemb = TimeDistributed(embedding_layer, name="w_embed")(text_mask)

But seeing as the code once made use of Embedding and then it was dropped again, I'm a bit suspicious. Was there a reason for this layer to be replaced with Dense?

(Embeddings also have a dropout keyword argument. So I guess the drop_wemb could be removed as well, if we pass self.dropin as a keyword argument. That would make the code more concise.)

evanmiltenburg commented 7 years ago

Also, I set mask_zero to True in this example, but the Dense layer right now doesn't seem to care about the padding value. The effect of this, I guess, is that the model right now doesn't mask out the embedding for the padding. Is that an issue? (For the same behavior as the current code, we could also just set mask_zero to False.)

The Keras docs say this about the Embedding layer's mask_zero argument:

mask_zero: Whether or not the input value 0 is a special "padding" value that should be masked out. This is useful for recurrent layers which may take variable length input. If this is True then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal |vocabulary| + 2).

I'm referring here to the Keras 1.0.6 docs, as the current docs are for Keras 2, but the GroundedTranslation README says it requires 1.0.7. (I couldn't find 1.0.7 docs, so 1.0.6 is the closest I could get.)

elliottd commented 7 years ago

I think that the Embedding layer did not exist when we first wrote this code.

If you can get the Embedding layer to work, then please submit a pull request.

evanmiltenburg commented 7 years ago

I tried to implement the change, but it didn't work. I don't think I understand the model as it is right now. Here's the relevant code:

    def buildKerasModel(self, use_sourcelang=False, use_image=True):
        '''
        Define the exact structure of your model here. We create an image
        description generation model by merging the VGG image features with
        a word embedding model, with an RNN over the sequences.
        '''
        logger.info('Building Keras model...')

        text_input = Input(shape=(self.max_t, self.vocab_size), name='text')
        text_mask = Masking(mask_value=0., name='text_mask')(text_input)

        # Word embeddings
        wemb = TimeDistributed(Dense(output_dim=self.embed_size,
                                      input_dim=self.vocab_size,
                                      W_regularizer=l2(self.l2reg)),
                                      name="w_embed")(text_mask)
        drop_wemb = Dropout(self.dropin, name="wemb_drop")(wemb)

        # Embed -> Hidden
        emb_to_hidden = TimeDistributed(Dense(output_dim=self.hidden_size,
                                      input_dim=self.vocab_size,
                                      W_regularizer=l2(self.l2reg)),
                                      name='wemb_to_hidden')(drop_wemb)

I have no idea what emb_to_hidden does, or why it even works. Here's why:

But then why is the input dimension for the last layer equal to self.vocab_size? Shouldn't that be the same as embed_size? I just tried this out, and the model still seems to work.

Also, the documentation for Embedding says the following:

Turn positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]

This layer can only be used as the first layer in a model.

So it can only work on a sequence, whereas the Dense layer works on a single one-hot vector. Also, the documentation says that the difference between Dense and Embedding is that the latter does not have bias terms. That also means: fewer parameters. This seems like a desirable property, so I'd really like to get this to work. (Otherwise, we could just keep using the Dense layer.)

Embedding could also simplify the code:

Here's what I tried:

        text_input = Input(shape=(self.max_t, self.vocab_size), name='text')

        # Word embeddings
        wemb = Embedding(output_dim=self.embed_size,
                         input_dim=self.vocab_size,
                         W_regularizer=l2(self.l2reg),
                         mask_zero=True,
                         name="w_embed")(text_input)

        drop_wemb = Dropout(self.dropin, name="wemb_drop")(wemb)
        emb_to_hidden = TimeDistributed(Dense(output_dim=self.hidden_size,
                                      input_dim=self.vocab_size,
                                      W_regularizer=l2(self.l2reg)),
                                      name='wemb_to_hidden')(drop_wemb)

This results in an error: Exception: Input 0 is incompatible with layer wemb_to_hidden: expected ndim=3, found ndim=4. I think this has something to do with the way TimeDistributed and batching work, but I don't understand what's going on.

elliottd commented 7 years ago

I think the problem is that applying a TimeDistributed over the Embedding layer is producing a rank=4 tensor. The Embedding layer already returns a rank=3 tensor (batch size, length, features) so you don't need a TimeDistributed over that.

You might have found a bug with the input to emb_to_hidden having an input dimensionality of vocab_size. I don't understand why that works either.

On Mon, 25 Sep 2017 at 17:31 Emiel van Miltenburg notifications@github.com wrote:

I tried to implement the change, but it didn't work. I don't think I understand the model as it is right now. Here's the relevant code:

def buildKerasModel(self, use_sourcelang=False, use_image=True):
    '''        Define the exact structure of your model here. We create an image        description generation model by merging the VGG image features with        a word embedding model, with an RNN over the sequences.        '''
    logger.info('Building Keras model...')

    text_input = Input(shape=(self.max_t, self.vocab_size), name='text')
    text_mask = Masking(mask_value=0., name='text_mask')(text_input)

    # Word embeddings

    wemb = TimeDistributed(Dense(output_dim=self.embed_size,
                                  input_dim=self.vocab_size,
                                  W_regularizer=l2(self.l2reg)),
                                  name="w_embed"

)(text_mask) drop_wemb = Dropout(self.dropin, name="wemb_drop")(wemb)

    # Embed -> Hidden
    emb_to_hidden = TimeDistributed(Dense(output_dim=self.hidden_size,
                                  input_dim=self.vocab_size,
                                  W_regularizer=l2(self.l2reg)),
                                  name='wemb_to_hidden')(drop_wemb)

I have no idea what emb_to_hidden does, or why it even works. Here's why:

  • I assume that w_embed takes a one-hot vector and produces an embedding.
  • I assume that wemb_drop then randomly drops a proportion (self.dropin) of the values during training, so as to make the model more robust.
  • I assume that wemb_to_hidden connects the embedding layer to the hidden layer.

But then why is the input dimension for the last layer equal to self.vocab_size? Shouldn't that be the same as embed_size? I just tried this out, and the model still works.

Also, the documentation for Embedding says the following:

Turn positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]

This layer can only be used as the first layer in a model.

So it can only work on a sequence, whereas the Dense layer works on a single one-hot vector. Also, the documentation says that the difference between Dense and Embedding is that the latter does not have bias terms. That also means: fewer parameters. This seems like a desirable property, so I'd really like to get this to work. (Otherwise, we could just keep using the Dense layer.)

Embedding could also simplify the code:

  • It takes care of masking, so we don't need the Masking layer.
  • It takes care of dropout, so we don't need the dropout layer.
  • It works on sequences out of the box, so I don't think it needs to be wrapped by TimeDistributed.

Here's what I tried:

    text_input = Input(shape=(self.max_t, self.vocab_size), name='text')

    # Word embeddings
    wemb = Embedding(output_dim=self.embed_size,
                     input_dim=self.vocab_size,
                     W_regularizer=l2(self.l2reg),
                     mask_zero=True,
                     name="w_embed")(text_input)

    drop_wemb = Dropout(self.dropin, name="wemb_drop")(wemb)
    emb_to_hidden = TimeDistributed(Dense(output_dim=self.hidden_size,
                                  input_dim=self.vocab_size,
                                  W_regularizer=l2(self.l2reg)),
                                  name='wemb_to_hidden')(drop_wemb)

This results in an error: Exception: Input 0 is incompatible with layer wemb_to_hidden: expected ndim=3, found ndim=4. I think this has something to do with the way TimeDistributed and batching work, but I don't understand what's going on.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/elliottd/GroundedTranslation/issues/29#issuecomment-331937695, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWqf4zC22L9lAAunmRRYY8pQHKB-p3Iks5sl9VngaJpZM4Phsw_ .

evanmiltenburg commented 7 years ago

So I tried to do what we discussed earlier (remove TimeDistributed and use the Embedding directly), but it still gives the same error: Exception: Input 0 is incompatible with layer rnn: expected ndim=3, found ndim=4. Here's the relevant code:

    text_input = Input(shape=(self.max_t, self.vocab_size), name='text')

    # Word embeddings
    wemb = Embedding(output_dim=self.embed_size,
                     input_dim=self.vocab_size,
                     input_length=self.max_t,
                     W_regularizer=l2(self.l2reg),
                     mask_zero=True,
                     name="w_embed")(text_input)

The LSTM part:

        logger.info("Building an LSTM")
        rnn = InitialisableLSTM(output_dim=self.hidden_size,
                  input_dim=self.hidden_size,
                  return_sequences=True,
                  W_regularizer=l2(self.l2reg),
                  U_regularizer=l2(self.l2reg),
                  name='rnn')([wemb, rnn_initialisation])# was ([emb_to_hidden, rnn_initialisation])

I'm using hidden_size=300 for the time being, but ideally this would be customizable as well. That would require another layer between Embedding and InitialisableLSTM.

elliottd commented 7 years ago

A Keras tutorial [1] says the input to an embedding layer should be 2D. Is our text input already 3D?

If it is already 3D, you'll need to change the data generator to yield appropriate 2D inputs.

Perhaps the tutorial from above can offer some useful guidance.

[1] https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

On Tue, 31 Oct 2017, 12:44 Emiel van Miltenburg, notifications@github.com wrote:

So I tried to do what we discussed earlier (remove TimeDistributed and use the Embedding directly), but it still gives the same error: Exception: Input 0 is incompatible with layer rnn: expected ndim=3, found ndim=4. Here's the relevant code:

text_input = Input(shape=(self.max_t, self.vocab_size), name='text')

# Word embeddings
wemb = Embedding(output_dim=self.

embed_size, input_dim=self.vocab_size, input_length=self.max_t, W_regularizer=l2(self.l2reg), mask_zero=True, name="w_embed")(text_input)

The LSTM part:

    logger.info("Building an LSTM")
    rnn = InitialisableLSTM(output_dim=self.hidden_size,
              input_dim=self.hidden_size,
              return_sequences=True,
              W_regularizer=l2(self.l2reg),
              U_regularizer=l2(self.l2reg),
              name='rnn')([wemb, rnn_initialisation])# was ([emb_to_hidden, rnn_initialisation])

I'm using hidden_size=300 for the time being, but ideally this would be customizable as well. That would require another layer between Embedding and InitialisableLSTM.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/elliottd/GroundedTranslation/issues/29#issuecomment-340750939, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWqf7wQCGBOCyjj8k1TI2N6-FQvA3_aks5sxxYqgaJpZM4Phsw_ .

evanmiltenburg commented 7 years ago

I would say it is 2D, or at least the code says it is. Or am I misunderstanding the shape-argument?

    text_input = Input(shape=(self.max_t, self.vocab_size), name='text')
    text_mask = Masking(mask_value=0., name='text_mask')(text_input)
evanmiltenburg commented 7 years ago

I'll just use this thread for notes of things that I don't understand.

Why does format_sequence skip tokens that don't exceed the UNK threshold? That results in ungrammatical descriptions like "I saw the yesterday" which are used to train the system. I can imagine that having an UNK token does help with the quality of the descriptions because at least the system is aware that there should be SOME word in there.

elliottd commented 7 years ago

I think that replicates some other implementations that discarded the tokens completely. It probably makes more sense to add an UNK token to the vocabulary and use the UNK threshold to substitute the tokens.

Also, it's easier for me to track these different issues if you create different issues. I might forget that we discussed the format_sequence and UNK token in this thread about the Embedding layer.

evanmiltenburg commented 7 years ago

Also, it's easier for me to track these different issues if you create different issues. I might forget that we discussed the format_sequence and UNK token in this thread about the Embedding layer.

Ok, I opened #31 for this.

evanmiltenburg commented 7 years ago

I'm not sure how to initialize the output layers. Or at least: I don't know what to do with the size of the hidden layer. Here's the current code:

        # Recurrent layer
        if self.gru:
            logger.info("Building a GRU")
            rnn = InitialisableGRU(output_dim=self.hidden_size,
                      input_dim=self.hidden_size,
                      return_sequences=True,
                      W_regularizer=l2(self.l2reg),
                      U_regularizer=l2(self.l2reg),
                      name='rnn')([emb_to_hidden, rnn_initialisation])
        else:
            logger.info("Building an LSTM")
            rnn = InitialisableLSTM(output_dim=self.hidden_size,
                      input_dim=self.hidden_size,
                      return_sequences=True,
                      W_regularizer=l2(self.l2reg),
                      U_regularizer=l2(self.l2reg),
                      name='rnn')([emb_to_hidden, rnn_initialisation])

        output = TimeDistributed(Dense(output_dim=self.vocab_size,
                                       input_dim=self.hidden_size,
                                       W_regularizer=l2(self.l2reg),
                                       activation='softmax'),
name='output')(rnn)

Because the output layer has self.hidden_size as its input_dim, we can't put embeddings.T there as the weights because the dimensions don't line up. Two options:

  1. Make the LSTM/RNN output dimensions the same as self.embed_size, so that the output layer has self.embed_size as its input dimensions.
  2. Add another TimeDistributed(Dense(...)) in between the LSTM/RNN and the current output layer to map the self.hidden_size to self.embed_size. Then we can make the output layer have the correct dimensions.
elliottd commented 7 years ago

I think you want option 2. See Lior and Wolf 2017 for some more details on tying the embeddings.