Closed enverfakhan closed 3 years ago
I realized this solution is prone to floating point error, an example for such an error would be the following
npatch = 3904
N = 784
w, h = 491, 532
self.patch_embed.patch_size = 8
w0 = w // self.patch_embed.patch_size # 61
h0 = h // self.patch_embed.patch_size # 64
pos_embed = nn.functional.interpolate(
pos_embed.reshape(1, int(math.sqrt(N)), int(math.sqrt(N)), dim).permute(0, 3, 1, 2),
scale_factor=(w0 / math.sqrt(N), h0 / math.sqrt(N)),
mode='bicubic',
)
pos_embed = pos_embed.permute(0, 2, 3, 1).view(1, -1, dim) # pos_embed.shape[1] is 3840 (60 * 64) which is supposed to be 3904 (61 * 64)
This happens because w0 / math.sqrt(N) * math.sqrt(N)
is not equal w0
and nn.functional.interpolate
casts int
on the scale_factor
(I assume) and w0 / math.sqrt(N) * math.sqrt(N)
is something like 60.999999999..
and it become 60
when int
is casted on. Now a solution for this problem would be adding a small number (0.1) to w0
and h0
.
However this brings another question, does interpolating the positional embedings makes sense?, I mean does this operation happen during training or all the images in the training are just in the right shape and interpolation never occur? Because in that case interpolating positional embeddings (learned positional embedding) would be illegitimate right?
Hi @enverfakhan
Thanks for raising this issue. Yes I totally agree that something could be done to simplify/unify a bit the code there...
For the floating point error I've found that workaround: https://github.com/facebookresearch/dino/blob/1d06521adfc53c80dece1a74902f718873e9821d/vision_transformer.py#L235-L240
I mean does this operation happen during training or all the images in the training are just in the right shape and interpolation never occur?
This operation actually happens during training. Indeed, with multi-crop the model is trained both with images of 224x224 and images of 96x96. So when forwarding a batch of 96^2 images we need to interpolate the encodings. As a matter of fact in my experiments I also tried having two sets of encodings. In practice that means that I was using differents encodings for the 224x224 and for the 96x96 inputs. This solution has exactly the same performance as when performing bicubic interpolation, which makes me think that the interpolation solution makes sense.
Hi @mathildecaron31 thanks for the response and also I appreciate the insight about the interpolation vs separate pos_embed a lot. I would be curios about how would that behave in the wild with completely different sizes. I actually tried the deit_small(patch_size=8) for retrieval task on a in-house data, it seems to be working on par with a supervised vgg imagenet, however I had to set the image sizes to [224, 224] because some of the images blow the memory during the attention computation.
About the workaround for the floating point error, I feel like incrementing the w0
and h0
a small amount is more legit than zero padding the pos_embed but it is probably not a big deal especially if the image size is relatively big.
Looking forward for the Dino on large, random, uncurated dataset.
I actually tried the deit_small(patch_size=8) for retrieval task on a in-house data, it seems to be working on par with a supervised vgg imagenet,
That's slightly disappointing :/. Have you tried the other models ? For example ViT-Base/16 should be more manageable memorywise. As a matter of fact, on copy detection datasets, I've found the base models to perform clearly better than the small ones: I get better performance with Base16x16 than with Small8x8 though Small8x8 is better at k-NN ImNet.
About the workaround for the floating point error, I feel like incrementing the w0 and h0 a small amount is more legit than zero padding the pos_embed but it is probably not a big deal especially if the image size is relatively big.
Yes your solution is definitely better ! I'll update that in the code.
I picked the Small 8x8 because it was shown that that performs better with k-NN ImNet and because I was going to try with zero shot for retrieval task this choice mad more sense at the time. The result was indeed slightly disappointing, however I haven't experimented exhaustively and I dont have quantitative result either, I only check qualitatively which you can only do it for a handful of query, so this result is not definitive at all. But I should say the in-house data is very different than the imagenet, so I wouldn't be very surprised if I got some weird result with either model.
But I should say the in-house data is very different than the imagenet, so I wouldn't be very surprised if I got some weird result with either model.
I wonder if finetuning DINO models on the in-house data you have might help? But you mentioned, that results are on par with vgg pretrained on imagenet so I am not very sure. Probably still worth trying.
I actually tried the deit_small(patch_size=8) for retrieval task on a in-house data, it seems to be working on par with a supervised vgg imagenet
I guess I caused a misinformation unintentionally. The images were RGBA and I was treating them as RGB. I'm sorry if I caused any confusion. However after I accounted for that image format, the result still varies from query (image) to query. In some cases Dino outperforms a vgg_16 ImNet by far, but in some other cases they are almost on par or even worse. I haven't detected a consistent pattern for which images that Dino outperform or under-perform, but so far it seems like, anecdotally, Dino outperform in colorful images and it is on par (or worse) with vgg_16 ImNet for black and white images. By the way, I'm testing OpenAI's clip model with ViT too and Clip model seems to be the worst among the three (I was betting on the clip model that it would be the best, but couldn't be more wrong :) )
it's hard to come up with a quantitative evaluation. The images are multi--tagged and we are trying to retrieve similar images given a query. The tags are not reliable for evaluation because some similar images doesn't share any tag, or reverse is also the case, different images may share some common tag (instagram logo vs instagram app image). I'm planning to do a finetunining as multi class classification and try to get some numeric assessment out of that.
I wonder if finetuning DINO models on the in-house data you have might help?
I strongly believe it would help, but I wonder which model would perform better after finetuning with each. However the preliminary result that Dino is working better on colorful images is worth to pay attention.
By the way, I'm testing OpenAI's clip model with ViT too and Clip model seems to be the worst among the three (I was betting on the clip model that it would be the best, but couldn't be more wrong :) )
Not surprised. :D cf. https://github.com/openai/CLIP/issues/1 and https://openai.com/blog/multimodal-neurons/
Thank you for the reference to the issue, it was super fun to read and check the examples (and of course eye opening :) ). So the Clip model is out of option.
@mathildecaron31 I have a question about copy detection. I am trying to evaluate the pretrained DINO models on a dataset for copy detection task and I am trying to follow the steps from the paper. Even with different image input sizes in Table 4 we see that final embedding dimension is 1536. I am not able to understand how we can get same embedding dimension after concatenating CLS embedding and GeM pooled output patch tokens for different input image sizes. Maybe I am missing a point here. Here is what I did:
Added the following method to VisionTransformer
to return output patch tokens and cls output.
def forward_output_patch_tokens_cls(self, x):
B = x.shape[0]
x = self.patch_embed(x)
cls_tokens = self.cls_token.expand(B, -1, -1)
x = torch.cat((cls_tokens, x), dim=1)
pos_embed = self.interpolate_pos_encoding(x, self.pos_embed)
x = x + pos_embed
x = self.pos_drop(x)
for blk in self.blocks:
x = blk(x)
if self.norm is not None:
x = self.norm(x)
return x
Using GeM module from here
def gem(x, p=3, eps=1e-6):
"x: BS x num tokens x embed_dim"
return F.avg_pool1d(x.clamp(min=eps).pow(p), (x.size(-1))).pow(1./p)
class GeM(nn.Module):
def __init__(self, p=3, eps=1e-6):
super(GeM,self).__init__()
self.p = nn.Parameter(torch.ones(1)*p)
self.eps = eps
def forward(self, x):
return gem(x, p=self.p, eps=self.eps)
def __repr__(self):
return self.__class__.__name__ + '(' + 'p=' + '{:.4f}'.format(self.p.data.tolist()[0]) + ', ' + 'eps=' + str(self.eps) + ')'
Collect embeddings (CLS + GeM Pooled Output Patch Tokens)
all_image_features = []
with torch.no_grad():
for imgb in progress_bar(image_dl):
outputs = model.forward_output_patch_tokens_cls(imgb.cuda())
cls_token, output_patch_tokens = outputs[:,0],outputs[:,1:]
cls_features = cls_token
patch_features = gem_pooling(output_patch_tokens.permute(0,2,1)).squeeze(-1)
concat_features = torch.cat([cls_features,patch_features],dim=-1)
all_image_features.append(concat_features.cpu())
Following this and using an image size of 224 for dino_vitb8
my final embedding dimension is 1568 1536. Which can also be calculated as:
cls_feature_dim*2 = 768*2
Question
Also, during copy detection task do you learn the pooling parameter p
or is it picked based on validation set? I didn't quite understand the whitening part is it same as regular unsupervised PCA?
Found this paper: https://hal.inria.fr/hal-00722622v2/document. I believe idea is coming from here.
Edit:
Figured out the 1536 dimension size. We need to pool across token positions, so this gives pooled embedding with same dimension as cls token embedding dimension.
Hi @KeremTurgutlu , let me open a new issue :)
@enverfakhan I have incorporated your suggested fix for the floating point error and have also been trying to improve the forward logic in the vision_transformer.py code. Thanks a lot for your suggestion and feedback is appreciated if you do have some time :). https://github.com/facebookresearch/dino/blob/6687929d7cdc2e7a5150f6e24c2b6713293944ac/vision_transformer.py#L174-L233
I'm closing this issue. Feel free to reopen is there is other problem related to the interpolation of the positional encodings.
@mathildecaron31
This operation actually happens during training. Indeed, with multi-crop the model is trained both with images of 224x224 and images of 96x96. So when forwarding a batch of 96^2 images we need to interpolate the encodings. As a matter of fact in my experiments I also tried having two sets of encodings. In practice that means that I was using differents encodings for the 224x224 and for the 96x96 inputs. This solution has exactly the same performance as when performing bicubic interpolation, which makes me think that the interpolation solution makes sense.
patch_pos_embed = nn.functional.interpolate(
patch_pos_embed.reshape(1, int(math.sqrt(N)), int(math.sqrt(N)), dim).permute(
0, 3, 1, 2
),
scale_factor=(w0 / math.sqrt(N), h0 / math.sqrt(N)),
mode="bicubic",
)
So when a 96x96 cropped image is fed to the model during its training, the positional embeddings of the original 224x224 model get scaled down to 96x96? (assuming a patch size of 8, pos embeddings of size 392x392 will get downsized to 72x72)
Couldn't we just use a subset of the original embeddings for the smaller images? Like in text models, we can give the model smaller sequences with no problems. Again assuming a patch size of 8, this means gettings the first 72x72 subset of the whole 392x392 positional embeddings.
I guess the current interpolation regime will make model more invariant to the scale of the images ...
Hi! I stumbled on the same issue when using dinov2, the code crashed on the same function when using rectangular input...
In the function to encode positions, this github issue was referenced:
def interpolate_pos_encoding(self, x, w, h):
previous_dtype = x.dtype
npatch = x.shape[1] - 1
N = self.pos_embed.shape[1] - 1
if npatch == N and w == h:
return self.pos_embed
pos_embed = self.pos_embed.float()
class_pos_embed = pos_embed[:, 0]
patch_pos_embed = pos_embed[:, 1:]
dim = x.shape[-1]
w0 = w // self.patch_size
h0 = h // self.patch_size
print(f'DEBUG dinov2 vision_trasnformer.py: w0={w0}, h0={h0}')
# we add a small number to avoid floating point error in the interpolation
# see discussion at https://github.com/facebookresearch/dino/issues/8
w0, h0 = w0 + self.interpolate_offset, h0 + self.interpolate_offset
print(f'DEBUG dinov2 vision_trasnformer.py: add small number w0={w0}, h0={h0}')
sqrt_N = math.sqrt(N)
sx, sy = float(w0) / sqrt_N, float(h0) / sqrt_N
patch_pos_embed = nn.functional.interpolate(
patch_pos_embed.reshape(1, int(sqrt_N), int(sqrt_N), dim).permute(0, 3, 1, 2),
scale_factor=(sx, sy),
mode="bicubic",
antialias=self.interpolate_antialias,
)
print(f'DEBUG dinov2 vision_trasnformer.py: patch_pos_embed.shape={patch_pos_embed.shape}')
assert int(w0) == patch_pos_embed.shape[-2]
assert int(h0) == patch_pos_embed.shape[-1]
patch_pos_embed = patch_pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)
return torch.cat((class_pos_embed.unsqueeze(0), patch_pos_embed), dim=1).to(previous_dtype)
I checked the output by feeding rectangular image and found out the small addition did not change the w and h, see output:
image_crop.shape=torch.Size([1, 3, 434, 546])
DEBUG dinov2 vision_trasnformer.py: w0=31, h0=39
DEBUG dinov2 vision_trasnformer.py: add small number w0=31.0, h0=39.0
DEBUG dinov2 vision_trasnformer.py: patch_pos_embed.shape=torch.Size([1, 384, 31, 38])
Note that in the init, interpolate_offset=0.1
.
Here are the errors I got:
File "home/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/models/vision_transformer.py", line 204, in interpolate_pos_encoding
assert int(w0) == patch_pos_embed.shape[-2]
AssertionError
Note: used pretrained dinov2_vits14_reg model.
Is there a reason the interpolate call doesn't set the output size directly to (w0,h0) using the size parameter, rather than using the scale_factor parameter?
Is there a reason the interpolate call doesn't set the output size directly to (w0,h0) using the size parameter, rather than using the scale_factor parameter?
+1
Why cannot we just do
patch_pos_embed = nn.functional.interpolate(
patch_pos_embed.reshape(1, h0, w0, dim).permute(0, 3, 1, 2),
mode="bicubic",
antialias=self.interpolate_antialias,
**kwargs,
)
I notice the generation of positional embedding in
interpolate_pos_encoding
method is slightly different than the one in theforward_selfattention
method. The following simple modification bring both into the same page, to your interest.