XifengGuo / CapsNet-Keras

A Keras implementation of CapsNet in NIPS2017 paper "Dynamic Routing Between Capsules". Now test error = 0.34%.
MIT License
2.47k stars 651 forks source link

Problem with batch_dot #98

Open jpviguerasguillen opened 5 years ago

jpviguerasguillen commented 5 years ago

I have updated the code for tensorflow 2.x with integrated Keras. It is supposed it should all work the same. I am running it in Google Colab. I have the following problem:

In CapsuleLayer, once the input x is expanded and tiled, it is multiplied with the weight matrix W. I have added some print(x.shape) and I get:

class CapsuleLayer(layers.Layer):
  # ...

def call(self, inputs, training=None):
  # inputs.shape=[None, input_num_capsule, input_dim_capsule]
  # inputs_expand.shape=[None, 1, input_num_capsule, input_dim_capsule]
  inputs_expand = K.expand_dims(inputs, 1)

  # Replicate num_capsule dimension to prepare being multiplied by W
  # inputs_tiled.shape=[None, num_capsule, input_num_capsule, 
  #                     input_dim_capsule]
  inputs_tiled = K.tile(inputs_expand, [1, self.num_capsule, 1, 1])
  print('This is x:', inputs_tiled.shape)

  print('This is W:', self.W.shape)

  #inputs_tiled = K.expand_dims(inputs_tiled, 4)
  #print('This is x expanded:')
  #print(inputs_tiled.shape)

  # Compute `inputs * W` by scanning inputs_tiled on dimension 0.
  # x.shape=[num_capsule, input_num_capsule, input_dim_capsule]
  # W.shape=[num_capsule, input_num_capsule, dim_capsule, input_dim_capsule]
  # Regard the first two dimensions as `batch` dimension,
  # then matmul: [input_dim_capsule] x [dim_capsule, input_dim_capsule]^T -> 
  #              [dim_capsule].
  # inputs_hat.shape = [None, num_capsule, input_num_capsule, dim_capsule]
  inputs_hat = K.map_fn(lambda x: K.batch_dot(x, self.W, [2, 3]), 
                        elems=inputs_tiled)

  print('This is inputs_hat (`inputs * W`):', inputs_hat.shape)

  # Begin: Routing algorithm ----------------------------------------------#

This is x: (None, 10, 1152, 8) This is W: (10, 1152, 16, 8) This is inputs_hat (inputs * W): (None, 10, 1152, 1152, 16)

However, it is expected: inputs_hat.shape = [None, 10, 1152, 16]

Subsequently, I get this error for the next batch_dot (but this is expected, as inputs_hat is already wrong:

ValueError: Cannot do batch_dot on inputs with shapes (None, 10, 10, 1152, 16) and (None, 10, 1152, 1152, 16) with axes=[2, 3]. x.shape[2] != y.shape[3] (10 != 1152).

I have tried to expand x with a last dimention, i.e. to (None, 10, 1152, 8, 1), but surprisingly this gives:

This is inputs_hat (inputs * W): (None, 10, 1152, 1, 1152, 16)

I don't understand why 1152 is replicated in this matmul! This matrix multiplication should be easy!

jpviguerasguillen commented 5 years ago

I managed to solve it by changing the code. I used matmul instead of batch_dot. For that, inputs_tiled (without the batch dimension) needed to have the same rank as W. Note that I use tensorflow library directly:

  inputs_expand = tf.expand_dims(inputs, 1)
  inputs_tiled  = tf.tile(inputs_expand, [1, self.num_capsule, 1, 1])
  inputs_tiled  = tf.expand_dims(inputs_tiled, 4)
  inputs_hat = tf.map_fn(lambda x: tf.matmul(self.W, x), elems=inputs_tiled)
guido-niku commented 5 years ago

Did you also have to update this line : b += K.batch_dot(outputs, inputs_hat, [2, 3]) to this b += tf.matmul(self.W, x) i did this mindlessly because i was the same error at that line. Is this the right way to correct it ?

jpviguerasguillen commented 5 years ago

Yes, I also changed that. The code changed substantially with respect to the original one. Here it is the call function of CapsuleLayer (note that I use tensorflow API directly)

import tensorflow as tf   # Using tensorflow 2.0.0
from tensorflow.keras import layers, initializers
from tensorflow.keras import backend as K

# ... 

def call(self, inputs, training=None):
  # Expand the input in axis=1, tile in that axis to num_capsule, and 
  # expands another axis at the end to prepare the multiplication with W.
  #  inputs.shape=[None, input_num_capsule, input_dim_capsule]
  #  inputs_expand.shape=[None, 1, input_num_capsule, input_dim_capsule]
  #  inputs_tiled.shape=[None, num_capsule, input_num_capsule, 
  #                            input_dim_capsule, 1]
  inputs_expand = tf.expand_dims(inputs, 1)
  inputs_tiled  = tf.tile(inputs_expand, [1, self.num_capsule, 1, 1])
  inputs_tiled  = tf.expand_dims(inputs_tiled, 4)

  # Compute `W * inputs` by scanning inputs_tiled on dimension 0 (map_fn).
  # - Use matmul (without transposing any element). Note the order!
  # Thus:
  #  x.shape=[num_capsule, input_num_capsule, input_dim_capsule, 1]
  #  W.shape=[num_capsule, input_num_capsule, dim_capsule,input_dim_capsule]
  # Regard the first two dimensions as `batch` dimension,
  # then matmul: [dim_capsule, input_dim_capsule] x [input_dim_capsule, 1]-> 
  #              [dim_capsule, 1].
  #  inputs_hat.shape=[None, num_capsule, input_num_capsule, dim_capsule, 1]

  inputs_hat = tf.map_fn(lambda x: tf.matmul(self.W, x), elems=inputs_tiled)     

  # Begin: Routing algorithm ----------------------------------------------#
  # The prior for coupling coefficient, initialized as zeros.
  #  b.shape = [None, self.num_capsule, self.input_num_capsule, 1, 1].
  b = tf.zeros(shape=[tf.shape(inputs_hat)[0], self.num_capsule, 
                      self.input_num_capsule, 1, 1])

  assert self.routings > 0, 'The routings should be > 0.'
  for i in range(self.routings):
    # Apply softmax to the axis with `num_capsule`
    #  c.shape=[batch_size, num_capsule, input_num_capsule, 1, 1]
    c = layers.Softmax(axis=1)(b)

    # Compute the weighted sum of all the predicted output vectors.
    #  c.shape =  [batch_size, num_capsule, input_num_capsule, 1, 1]
    #  inputs_hat.shape=[None, num_capsule, input_num_capsule,dim_capsule,1]
    # The function `multiply` will broadcast axis=3 in c to dim_capsule.
    #  outputs.shape=[None, num_capsule, input_num_capsule, dim_capsule, 1]
    # Then sum along the input_num_capsule
    #  outputs.shape=[None, num_capsule, 1, dim_capsule, 1]
    # Then apply squash along the dim_capsule
    outputs = tf.multiply(c, inputs_hat)
    outputs = tf.reduce_sum(outputs, axis=2, keepdims=True)
    outputs = squash(outputs, axis=-2)  # [None, 10, 1, 16, 1]

    if i < self.routings - 1:
      # Update the prior b.
      #  outputs.shape =  [None, num_capsule, 1, dim_capsule, 1]
      #  inputs_hat.shape=[None,num_capsule,input_num_capsule,dim_capsule,1]
      # Multiply the outputs with the weighted_inputs (inputs_hat) and add  
      # it to the prior b.  
      outputs_tiled = tf.tile(outputs, [1, 1, self.input_num_capsule, 1, 1])
      agreement = tf.matmul(inputs_hat, outputs_tiled, transpose_a=True)
      b = tf.add(b, agreement)

  # End: Routing algorithm ------------------------------------------------#
  # Squeeze the outputs to remove useless axis:
  #  From  --> outputs.shape=[None, num_capsule, 1, dim_capsule, 1]
  #  To    --> outputs.shape=[None, num_capsule,    dim_capsule]
  outputs = tf.squeeze(outputs, [2, 4])
  return outputs
azzageee commented 4 years ago

Thanks @jpviguerasguillen , that is a great solutions that has got this code working for tf2.0!

asirigawesha commented 4 years ago

Thank you very much..!!!!This works !!!! Solution works on ft 2.3.0

yeopee2 commented 3 years ago

what is num_capsule?

jpviguerasguillen commented 3 years ago

what is num_capsule?

That is the number of capsules in the current layer.

jpviguerasguillen commented 3 years ago

UPDATE: While the changes I indicated above works well, I later realized that this implementation of CapsNets has a "big issue": it is not implemented as the original authors designed it.

Sabour et al.'s paper ('Dynamic Routing between Capsules') said (page 4): "In total PrimaryCapsules has [32x6x6] capsule outputs (each output is an 8D vector) and each capsule in the [6x6] grid is sharing their weights with each other." However, this seemed a contradiction to the caption in their Figure 1 saying: "_W_ij is a weight matrix between each u_i; i being (1; 32x6x6) in PrimaryCapsules and vj, j being (1; 10)."

However, I believe that they intented as the first quote says. This is, the PrimaryCaps had initially a size 256x6x6, which then is interpreted as 32 capsules of 8 elements in a grid of 6x6, where all the (let's name it as) 'subcapsules' in the grid 6x6 are simply the evaluation of the capsule at the different spatial points. This simply means what they said in the beginning: the weight W_ij is shared among the capsules in the 6x6 grid.

The main issue here is the terminology: they are using the same term, capsules, to two different concepts. The "real capsule" would be the entity to find (let's say, a horizontal line, or a circle, and in their case they have 32 entities), whereas the "instanciation/output of the capsule" is the resulting "vector" at the different positions telling whether such entity exits or not. And, for that, we need to use the same weights W to all vectors in the 6x6 grid from the same "capsule".

0_9fvb_xaSSqW7XVb_

What does the code above do? It does NOT share weights, therefore the entities at the different spatial points could be completely different. The implementation above is NOT wrong per se, it simply does not consider the concept of looking for the same entity at different spatial points.

There is another missing thing, which I noticed in Sabour's github: they add a bias term. In their case, their bias term in the DigitCaps would be of size 16x10.

return-sleep commented 3 years ago

b = tf.zeros(shape=[tf.shape(inputs_hat)[0], self.num_capsule, self.input_num_capsule, 1, 1]) some wrong happened. NotImplementedError: Cannot convert a symbolic Tensor (digitcaps/strided_slice:0) to a numpy array.

myknotruby commented 3 years ago

UPDATE: While the changes I indicated above works well, I later realized that this implementation of CapsNets has a "big issue": it is not implemented as the original authors designed it.

Sabour et al.'s paper ('Dynamic Routing between Capsules') said (page 4): "In total PrimaryCapsules has [32x6x6] capsule outputs (each output is an 8D vector) and each capsule in the [6x6] grid is sharing their weights with each other." However, this seemed a contradiction to the caption in their Figure 1 saying: "_W_ij is a weight matrix between each u_i; i being (1; 32x6x6) in PrimaryCapsules and vj, j being (1; 10)."

I think there does NOT exist contradiction, only a misuse of word "weights" in the sentence "each capsule in the [6x6] grid is sharing their weights with each other_.". The word "weights" here means the convolution kernel weights, not the weight matrix W_ij in equation (2) (page 2). They never use "weights" to refer to W_ij, instead they use "weight matrix" in the paper. The author also use "weights" to refer to the convolution kernel weight in paragraph 3 in page 2.

The toal paragraph in page 3 and page 4 is as follows: "The second layer (PrimaryCapsules) is a convolutional capsule layer with 32 channels of convolutional 8D capsules (i.e. each primary capsule contains 8 convolutional units with a 9  9 kernel and a stride of 2). Each primary capsule output sees the outputs of all 25681 Conv1 units whose receptive fields overlap with the location of the center of the capsule. In total PrimaryCapsules has [32; 6; 6] capsule outputs (each output is an 8D vector) and each capsule in the [6; 6] grid is sharing their weights with each other. One can see PrimaryCapsules as a Convolution layer with Eq. 1 as its block non-linearity. The final Layer (DigitCaps) has one 16D capsule per digit class and each of these capsules receives input from all the capsules in the layer below."

In this paragraph, the author just only talk about convolution.

sandruskyi commented 3 years ago

@jpviguerasguillen YOU SAVED MY LIFE

dupsys commented 2 years ago

Hey guys, Tried the above but get an as following:

TypeError: ('Keyword argument not understood:', 'share_weights')