Closed joelpaulkoch closed 4 months ago
@joelpaulkoch it's the interpolation that causes the difference, I need to investigate more to figure out specifically why.
FTR the primary part of the difference in the interpolation was a bug https://github.com/elixir-nx/axon/pull/554. There are still at least two other differences in the implementation (antialiasing and scaling), I will keep looking into this.
Regarding interpolation I added anti-aliasing to Axon (https://github.com/elixir-nx/axon/pull/555), which makes the numbers closer to PyTorch. They are still far from an exact match, because the anti-aliasing behaves slightly differently and there is a small extra scaling in the interpolation call. That said, I think we are fine at this point. The model is trained for a specific size and the interpolation is then used at inference when using a different input size, so my understanding is that as long as the interpolation does its job we should expect reasonable output. In hf/transformers ViT interpolation is implemented in PyTorch and Tensorflow, and they are definitely not identical (our interpolation should basically match Tensorflow though).
For the interpolation we need the actual input size. I've hard coded the input size to 224 and retrieved it using Axon.get_inputs. Is there a better way to do it?
I changed the input height/width to unspecified ({nil, nil, nil, spec.num_channels}
). When building the model we don't know the size (Axon.get_inputs(pixel_values)["pixel_values"]
), but in this case we can make all of the interpolation an Axon layer, so that we can read the input tensor shape when the model is compiled with a concrete input size :)
This is great, thanks!
so I'm wondering whether it would make sense to introduce a block_impl parameter to Bumblebee.Layers.Transformer.blocks
I extended :block_type
to allow for a function. I was also considering supporting "hooks" between each block layer, but that's more indirect and less flexible, so I went with the custom function. The function receives the various layers as anonymous functions and can apply them in any order and with any custom layers on top. This feels a bit coupled, but fwiw it's all internal anyway.
In transformers they have another option to specify out_indices instead which I did not implement.
I actually changed it so that we only use the indices. The stages names are not really much more meaningful than indices since they are always "stem", "stage1", "stage2", ..., stageN
. By using the indices we don't need to worry about generating/loading the stages names, so I think it's a win.
I also removed the option to reshape feature maps, in favour of always doing it. Reshaping to the flat shape is easy to do if the caller wants that (which would be a wrapping model, rather than the end user anyway).
This PR is ready to ship, I will wait for https://github.com/elixir-nx/axon/pull/555 and then merge :) Thanks @joelpaulkoch :)
Yeah, sounds reasonable. Thank you very much for getting this into proper shape.
This is the current state of my work on DINOv2.
facebook/dinov2-base
usesBitImageProcessor
, so I've copiedVitFeaturizer
->BitFeaturizer
and made the following changes:rescale_factor
(and removeNxImage.to_continuous
inprocess_batch
)For the model itself I've copied
Vit
toDinoV2
and basically changed three blocks:interpolation of positional encodings The pre-trained positional encodings must be interpolated to apply them to other image resolutions (
interpolate_position_encoding
). For the interpolation we need the actual input size. I've hard coded the input size to 224 and retrieved it usingAxon.get_inputs
. Is there a better way to do it? The current implementation of the interpolation is not exactly the same as in the transformers library, so this probably introduces a difference in the calculation. DoesAxon.resize
return exactly the same for the same input astorch.nn.functional.interpolate
?Encoding blocks In DINOv2 the ffn is either mlp or swiglu, depending on the configuration. I could pass the corresponding function to the blocks. I still copied
blocks
,block
,block_impl
fromBumblebee.Layers.Transformer
because I needed two additional scaling layers inblock_impl
. This brings quite some duplicated code into the DINOv2 implementation for a small change, so I'm wondering whether it would make sense to introduce ablock_impl
parameter toBumblebee.Layers.Transformer.blocks
.map encoder output depending on architecture For :base, in comparison to
Vit
the pooled output is simply the first token on axis 1. For :backbone, you can pass a list ofstage_names
and specify which you want asoutput_features
in the configuration. Then the corresponding hidden state will be included in the output. Intransformers
they have another option to specifyout_indices
instead which I did not implement. For :for_image_classification architecture, there is a header on top of the class token and mean of patch embeddings.I've tried to follow the naming conventions but I could have missed that at some places. Parts of the documentation are still just copy/paste from
transformers
. At the moment the tests are configured to run the model fromfacebook/dinov2-base
and won't pass.