UCSC-VLAA / CLIPA

[NeurIPS 2023] This repository includes the official implementation of our paper "An Inverse Scaling Law for CLIP Training"
Apache License 2.0
298 stars 12 forks source link

How do you manage the positional embeddings for different image resolution? #4

Closed fabiozappo closed 1 year ago

fabiozappo commented 1 year ago

Hello,

I recently read the paper and found the details regarding the consecutive training phases where the image resolution is progressively increased intriguing. Based on my understanding, the model employs a vision transformer architecture with a fixed positional encoding scheme. For instance, in the case of a vit h14 image backbone:

In Step 0: During the pre-training phase, the input images of size 84x84 are associated with a positional encoding of shape [36, d]. In Step 1: For the full-resolution fine-tuning, input images of size 224x224 are used, along with positional encodings of shape [256, d].

My question arises from the transition between these training steps, specifically when attempting to utilize a saved checkpoint from Step 0 for fine-tuning in Step 1, given the differing dimensions of the positional encodings.

How can a checkpoint from Step 0, which contains positional encodings of shape [36, d], be effectively employed to continue the fine-tuning process in Step 1, where the expected positional encoding shape is [256, d] due to the larger input image resolution? Are there any modifications or strategies that need to be applied to the checkpoint or the model architecture to ensure compatibility between the checkpoint and the new training stage?

I'm looking forward to your insights on this matter. Thank you for your assistance!

P.S. At which image resolution are the model evaluated?

xhl-video commented 1 year ago

Thank you for your detailed observation and insightful questions!

Positional Encoding for Image Backbone (Vision Transformer):

Step 0: You're right. For the 84x84 input images during pre-training, we utilize positional encodings of shape [36, d]. Step 1: For the full-resolution fine-tuning with 224x224 images, the positional encodings are of shape [256, d]. Transitioning between these steps, we rely on the fixed sinusoidal positional embeddings inherent to the Vision Transformer. This eliminates the need for modifications when transitioning. While I did experiment with interpolating the fixed positional embedding from 36 to 224, any performance differences were marginal, even if there was a minor loss at the beginning.

Positional Encoding for Text Encoder: Our text encoder uses learnable positional embeddings. During the transition to Step 1, we interpolate the positional embeddings from 8 to 32 by default. Random initialization of these embeddings at Step 1 also yielded only marginal differences.

Evaluation: Models are evaluated at the same resolution as training. For instance, if trained at 84x84, they're tested at 84x84. The performance metrics we reported for the pre-trained weights are based on testing at smaller resolutions.

I hope this helps clear up your worries. Do let me know if you have any more questions!