cezannec / capsule_net_pytorch

Readable implementation of a Capsule Network as described in "Dynamic Routing Between Capsules" [Hinton et. al.]
MIT License
370 stars 130 forks source link

Squash function #3

Closed puhach closed 7 months ago

puhach commented 5 years ago

Very nice tutorial, though I want to point out that the squash formula in the notebook differs from the paper. Instead of

squash

it should be squash-from-paper

, so the first fraction is a factor a slightly below 1, and the second one nozmalizes vector coordinates by the magnitude.

As far as I can see, the implementation follows the second formula and it seems to be correct, except that I am not sure about the normalization dimension for primary capsules. According to the explanations from the notebook, each primary capsule outputs a vector of size 32 6 6. Then these vectors are stacked and, considering the batch dimension, we get a tensor of the shape

(batch_size, num_nodes_in_capsule = 32 * 6 * 6, num_capsules = 8)

Finally, these vectors are normalized, i.e. their magnitudes are squashed to be in the range from 0 to 1. If I understand correctly, you are talking about the magnitude of the (32 6 6)-dimensional vectors. So if we want to ensure that the length of these vectors is in range [0; 1], we would have to divide each of the (36 6 6) coordinates by the square root of the sum of squares of these coordinates. Right? In fact, the implementation divides each coordinate by the magnitude of a vector comprised of the coordinates in the same positions of all capsule vectors. See dim is set to -1 when calculating squared_norm, i.e. it sums up same features, but from different capsules.

Please, consider the following example:

import torch
import numpy as np

def squash(input_tensor):
    '''Squashes an input Tensor so it has a magnitude between 0-1.
        param input_tensor: a stack of capsule inputs, s_j
        return: a stack of normalized, capsule output vectors, v_j
        '''
    squared_norm = (input_tensor ** 2).sum(dim=-1, keepdim=True)    
    scale = squared_norm / (1 + squared_norm) # normalization coeff
    output_tensor = scale * input_tensor / torch.sqrt(squared_norm)    
    return output_tensor

np.random.seed(1)
torch.manual_seed(1)
batch_size = 15
dim=13
n_caps = 7
u = [torch.tensor(np.random.rand(batch_size, dim, 1)) for i in range(n_caps) ] 
#print(u)
u = torch.cat(u, dim=-1)
print("u:", u)

u_squash = squash(u)
print("u_squash:", u_squash)

mag = torch.sqrt( (u_squash **2).sum(dim=-2) )
print("mag: ", mag)

Here I create a randomly filled tensor of shape (batch_size, dim, n_caps), i.e. similar to those produced by the primary capsules. The tensor is squashed by the same function used in the notebook. It can be seen from the output that the magnitudes of the vectors exceeds the range [0; 1]:

mag:  tensor([[0.6629, 1.0954, 0.9715, 0.7817, 1.0211, 0.7117, 0.8847],
        [1.0202, 0.9313, 0.8816, 0.8383, 1.0355, 0.9926, 1.0803],
        [0.8864, 1.0694, 0.7617, 0.9194, 0.8355, 0.9432, 1.0051],
        [0.9630, 0.9198, 0.9078, 1.0516, 0.8845, 0.7888, 0.9238],
        [0.6996, 1.0998, 1.1319, 0.6556, 0.8243, 0.9571, 0.9614],
        [0.9705, 0.9879, 0.8915, 0.8308, 1.0063, 1.0607, 0.9306],
        [1.0569, 1.0294, 0.9268, 1.0508, 0.9768, 0.9505, 0.8103],
        [0.9545, 0.9655, 0.9052, 1.0720, 0.7246, 0.9666, 0.9669],
        [1.1237, 0.9768, 0.9749, 0.8128, 0.8935, 0.9216, 0.7607],
        [0.8785, 0.7155, 0.8306, 0.8913, 0.9764, 0.9692, 1.0892],
        [0.9691, 0.8658, 1.0399, 0.9774, 0.9309, 0.8950, 0.8872],
        [0.7124, 1.1386, 0.8535, 1.0913, 0.8478, 0.8779, 0.9850],
        [0.8909, 0.9851, 0.9247, 1.0239, 0.7927, 0.9618, 0.7925],
        [0.8764, 0.9524, 0.9294, 0.8517, 0.8385, 0.9380, 1.0824],
        [1.0076, 0.8668, 1.0051, 0.9030, 1.0067, 0.8850, 0.9519]],
       dtype=torch.float64)

It actually enforces the magnitudes of vectors comprised of particular coordinates from different capsule outputs to be in that range. But was that intended?

cezannec commented 7 months ago

Apologies that I did not respond earlier, I must have missed this. From the paper, the sj variable is in part a sum over the prediction vectors, uji from the capsules in the layer below. So a sum from different capsules is expected.

For your question about magnitude; you'll see in the code that the squash values (u_squash in your example) are the ones that are normalized; if you see the print out of u_squash, these should all be less than 1 :)

image