Closed imxtx closed 9 months ago
Hi, the "8 x 14 x 14 3D tokens" mentioned in the paper means there are (8 x 14 x 14 = 1568) tokens in the sequence, rather than the shape of each token is (8 x 14 x 14).
So in the next line, we highlighted the dimension is (2 x 16 x 16). Hope this can answer your question.
Hi, the "8 x 14 x 14 3D tokens" mentioned in the paper means there are (8 x 14 x 14 = 1568) tokens in the sequence, rather than the shape of each token is (8 x 14 x 14).
So in the next line, we highlighted the dimension is (2 x 16 x 16). Hope this can answer your question.
I see, that's why I think the description "The cube embedding layer generates ..." is not correct. It generates something means it outputs something. But the output of the 3D embedding layer is of shape [Batch, 8*14*14, 768]:
# part of the __init__ method of PatchEmbedding3d class
# [Batch, 3, 16, 224, 224] -> [Batch, 768, 8, 14, 14]
self.projection = Conv3d(c, embedding, kernel_size=(pt, ph, pw), stride=strides)
self.has_norm = build_normalization is not None
if self.has_norm:
self.normalization = build_normalization()
# [Batch, 768, 8, 14, 14] -> [Batch, 8*14*14, 768]
self.rearrange = Rearrange("b d nt nh nw -> b (nt nh nw) d")
The code of the Encoder:
# ...
self.embed_dim = embed_dim
self.patch_embedding = PatchEmbedding3d(
input_size=(3, n_frames, img_size, img_size), # [3,16,Height,Width]
patch_size=(tubelet_size, patch_size, patch_size), # [2,16,16]
embedding=embed_dim,
) # output size [Batch, 8*14*14, 768]
# ...
I see, that's why I think the description "The cube embedding layer generates ..." is not correct. It generates something means it outputs something. But the output of the 3D embedding layer is of shape [Batch, 81414, 768]:
In ViT (no matter 2D patch or 3D cube), we can use Conv layer to embed the tokens into vectors (here it is 768-d vector). With proper striding and kernel size, it will be equivalent to manually reshape it then use a linear layer to embed it.
Paper:
Given an input video (having a dimension 3 × 16 × 224 ×224), the cube embedding layer generates 8 × 14 × 14 3D tokens of dimension 768 to preserve spatio-temporal patterns.
In the released source code, 2 x 16 x 16 is the kernel size instead of the size of the 3D tokens:
Code for testing the 3D convolution:
Output: