Open riccardorenzulli opened 7 months ago
I'm having the same issue; @patricklabatut, any idea?
For what it's worth, I've seen this 'high norm' pattern occur with the dinov2-based image encoder used in the depth-anything model. It happens on the vit-l, vit-b and to some extent even on the vit-s model. A similar pattern appears using the 'ViT-L/14 distilled' backbone (from the dinov2 listing), but it's only visible on internal blocks.
Here are the norms of the different output blocks for vit-l (the depth-anything version) running on a picture of a turtle:
Here are some more examples:
Some notes:
Obviously it's not a conclusive result, I've only tried this on a few images, but it does seem similar to the effect described in the 'register' paper.
As a quick follow-up, I've tried this with the original dinov2 model & weights and got the same results. The original weights always have smaller norms on their final output (compared to the depth-anything weights), but vit-b & vit-l both show high norms internally. Results from vit-g have high norms even on the final output.
Here is an animation of the vit-g block norms (first-to-last) showing qualitatively similar results to the paper:
The 'with registers' versions of the models don't completely get rid of high norms in the later layers, but they do get rid of outliers.
For anyone wanting to try this, here's some code that uses the dinov2 repo/models and prints out the min & max norms for each block. Just make sure to set an image path and model name at the top of the script (use any of the pretrained backbone names from the repo listing):
And here's some code that can be added to the end of the code above for generating the visualizations (it pops up a window, so you need to be running the code locally).
@heyoeyo Thanks for the thorough explanations. I'll take a look.
Thank you very much @heyoeyo for your help and insights. We discovered that the problem in our code was the default value set to True
for the norm argument in x_layers = model.get_intermediate_layers(img, [len(model.blocks)-1])
. By adding norm=False
and collecting the embeddings for all layers, we get the same results as yours.
As you pointed out, surprisingly, the norms in a model without registers in the last layers are not that high, while for the model with registers, the norms become high but without outliers. I was surprised about this, especially given Figures 7 and 15 of the paper.
I was surprised about this, especially given Figures 7 and 15 of the paper.
Agreed! The output layer of the vitl-reg model has norms in the 150-400 range for the few images I've tried, as opposed to the <50 range reported by the paper.
I also find figure 3 vs 7 & 15 to be confusing, as fig. 3 suggests a non-register high-norm range of ~200-600 (consistent with what I've seen), whereas fig 7 & 15 show a 100-200 range for the high-norm tokens. Though I may be misinterpreting the plots.
Hello,
I'm having trouble identifying high-norm tokens, as mentioned in the "Vision Transformers Needs Registers" paper. I've seen that it is also mentioned at https://github.com/facebookresearch/dinov2/issues/293.
I used the L2 norm and the code from https://github.com/facebookresearch/dinov2/pull/306. To get the embedding vectors of the last layer, I use
x_layers = model.get_intermediate_layers(img, [len(model.blocks)-1]).
I tried with ViT-G/14 on the full ImageNet validation set with and without registers; however, as you can see see in the images below, the norms of the model without registers are not higher than 150, as written in the paper.
Did anyone succeed in reproducing the results of the main paper and identifying these high-norm tokens?