Add DINOv2 model - Githubissues

joelpaulkoch commented 4 months ago

This is the current state of my work on DINOv2.

facebook/dinov2-base uses BitImageProcessor, so I've copied VitFeaturizer -> BitFeaturizer and made the following changes:

add configuration options
support for "shortest_edge" resizing (I've noticed this changed with #329)
center crop the image
rescale the image with rescale_factor (and remove NxImage.to_continuous in process_batch)

For the model itself I've copied Vit to DinoV2 and basically changed three blocks:

interpolation of positional encodings The pre-trained positional encodings must be interpolated to apply them to other image resolutions (interpolate_position_encoding). For the interpolation we need the actual input size. I've hard coded the input size to 224 and retrieved it using Axon.get_inputs. Is there a better way to do it? The current implementation of the interpolation is not exactly the same as in the transformers library, so this probably introduces a difference in the calculation. Does Axon.resize return exactly the same for the same input as torch.nn.functional.interpolate?
Encoding blocks In DINOv2 the ffn is either mlp or swiglu, depending on the configuration. I could pass the corresponding function to the blocks. I still copied blocks, block, block_impl from Bumblebee.Layers.Transformer because I needed two additional scaling layers in block_impl. This brings quite some duplicated code into the DINOv2 implementation for a small change, so I'm wondering whether it would make sense to introduce a block_impl parameter to Bumblebee.Layers.Transformer.blocks.
map encoder output depending on architecture For :base, in comparison to Vit the pooled output is simply the first token on axis 1. For :backbone, you can pass a list of stage_names and specify which you want as output_features in the configuration. Then the corresponding hidden state will be included in the output. In transformers they have another option to specify out_indices instead which I did not implement. For :for_image_classification architecture, there is a header on top of the class token and mean of patch embeddings.

I've tried to follow the naming conventions but I could have missed that at some places. Parts of the documentation are still just copy/paste from transformers. At the moment the tests are configured to run the model from facebook/dinov2-base and won't pass.

jonatanklosko commented 4 months ago

@joelpaulkoch it's the interpolation that causes the difference, I need to investigate more to figure out specifically why.

jonatanklosko commented 4 months ago

FTR the primary part of the difference in the interpolation was a bug https://github.com/elixir-nx/axon/pull/554. There are still at least two other differences in the implementation (antialiasing and scaling), I will keep looking into this.

jonatanklosko commented 4 months ago

Regarding interpolation I added anti-aliasing to Axon (https://github.com/elixir-nx/axon/pull/555), which makes the numbers closer to PyTorch. They are still far from an exact match, because the anti-aliasing behaves slightly differently and there is a small extra scaling in the interpolation call. That said, I think we are fine at this point. The model is trained for a specific size and the interpolation is then used at inference when using a different input size, so my understanding is that as long as the interpolation does its job we should expect reasonable output. In hf/transformers ViT interpolation is implemented in PyTorch and Tensorflow, and they are definitely not identical (our interpolation should basically match Tensorflow though).

For the interpolation we need the actual input size. I've hard coded the input size to 224 and retrieved it using Axon.get_inputs. Is there a better way to do it?

I changed the input height/width to unspecified ({nil, nil, nil, spec.num_channels}). When building the model we don't know the size (Axon.get_inputs(pixel_values)["pixel_values"]), but in this case we can make all of the interpolation an Axon layer, so that we can read the input tensor shape when the model is compiled with a concrete input size :)

joelpaulkoch commented 4 months ago

This is great, thanks!

jonatanklosko commented 4 months ago

so I'm wondering whether it would make sense to introduce a block_impl parameter to Bumblebee.Layers.Transformer.blocks

I extended :block_type to allow for a function. I was also considering supporting "hooks" between each block layer, but that's more indirect and less flexible, so I went with the custom function. The function receives the various layers as anonymous functions and can apply them in any order and with any custom layers on top. This feels a bit coupled, but fwiw it's all internal anyway.

In transformers they have another option to specify out_indices instead which I did not implement.

I actually changed it so that we only use the indices. The stages names are not really much more meaningful than indices since they are always "stem", "stage1", "stage2", ..., stageN. By using the indices we don't need to worry about generating/loading the stages names, so I think it's a win.

I also removed the option to reshape feature maps, in favour of always doing it. Reshaping to the flat shape is easy to do if the caller wants that (which would be a wrapping model, rather than the end user anyway).

This PR is ready to ship, I will wait for https://github.com/elixir-nx/axon/pull/555 and then merge :) Thanks @joelpaulkoch :)

joelpaulkoch commented 4 months ago

Yeah, sounds reasonable. Thank you very much for getting this into proper shape.

elixir-nx / bumblebee

Add DINOv2 model #334