fine tuning procedure. - Githubissues

Hello @Ngheissari, In this case we start from pretrained general purpose CLIP checkpoint. All the weights from it are reused. If you check the evaluation results table here you'll see that non-tuned CLIP gives ~57% accuracy for zero-shot classification which is much better then random. This can be considered a benefit of fine-tuning CLIP model. The pretraining and fine-tuning objectives are the same: to align image and text embeddings produced by image and text encoders respectively. The difference is: CLIP is pretrained on large multi=domain dataset and the fine-tuning is done on small remote sensing images dataset. Then we can formulate the classification task as image-text matching task for zero-shot classification (see details in the readme evaluation section and evaluation script. In this way we don't have to add extra classification layer on top of the encoder. Experimenting with partially frozen model might be interesting as well. But this probably might have some benefit (if any with larger CLIP model. (by analogy with results reported here). This will be on our "further research" list. Comparing zero-shot classification accuracy to one obtained when using image encoder from CLIP as feature extractor and a backbone for fine-tuning would be another future TODO.

Let us know if this answers your questions and add any extra questions or comments you have

arampacha / CLIP-rsicd

fine tuning procedure. #28