Closed puhach closed 7 months ago
Apologies that I did not respond earlier, I must have missed this. From the paper, the sj variable is in part a sum over the prediction vectors, uji from the capsules in the layer below. So a sum from different capsules is expected.
For your question about magnitude; you'll see in the code that the squash values (u_squash in your example) are the ones that are normalized; if you see the print out of u_squash, these should all be less than 1 :)
Very nice tutorial, though I want to point out that the squash formula in the notebook differs from the paper. Instead of
it should be
, so the first fraction is a factor a slightly below 1, and the second one nozmalizes vector coordinates by the magnitude.
As far as I can see, the implementation follows the second formula and it seems to be correct, except that I am not sure about the normalization dimension for primary capsules. According to the explanations from the notebook, each primary capsule outputs a vector of size 32 6 6. Then these vectors are stacked and, considering the batch dimension, we get a tensor of the shape
Finally, these vectors are normalized, i.e. their magnitudes are squashed to be in the range from 0 to 1. If I understand correctly, you are talking about the magnitude of the (32 6 6)-dimensional vectors. So if we want to ensure that the length of these vectors is in range [0; 1], we would have to divide each of the (36 6 6) coordinates by the square root of the sum of squares of these coordinates. Right? In fact, the implementation divides each coordinate by the magnitude of a vector comprised of the coordinates in the same positions of all capsule vectors. See dim is set to -1 when calculating squared_norm, i.e. it sums up same features, but from different capsules.
Please, consider the following example:
Here I create a randomly filled tensor of shape (batch_size, dim, n_caps), i.e. similar to those produced by the primary capsules. The tensor is squashed by the same function used in the notebook. It can be seen from the output that the magnitudes of the vectors exceeds the range [0; 1]:
It actually enforces the magnitudes of vectors comprised of particular coordinates from different capsule outputs to be in that range. But was that intended?