Closed Ngheissari closed 2 years ago
Hello @Ngheissari, In this case we start from pretrained general purpose CLIP checkpoint. All the weights from it are reused. If you check the evaluation results table here you'll see that non-tuned CLIP gives ~57% accuracy for zero-shot classification which is much better then random. This can be considered a benefit of fine-tuning CLIP model. The pretraining and fine-tuning objectives are the same: to align image and text embeddings produced by image and text encoders respectively. The difference is: CLIP is pretrained on large multi=domain dataset and the fine-tuning is done on small remote sensing images dataset. Then we can formulate the classification task as image-text matching task for zero-shot classification (see details in the readme evaluation section and evaluation script. In this way we don't have to add extra classification layer on top of the encoder. Experimenting with partially frozen model might be interesting as well. But this probably might have some benefit (if any with larger CLIP model. (by analogy with results reported here). This will be on our "further research" list. Comparing zero-shot classification accuracy to one obtained when using image encoder from CLIP as feature extractor and a backbone for fine-tuning would be another future TODO.
Let us know if this answers your questions and add any extra questions or comments you have
Hi,
For fine tuning I often remove the top lawyer, add a small model on top of my model, freeze the base model , train the top model and then unfreeze the base model and retrain the whole thing.
In that way, large gradients of an untrained model wont mess up the pre trained model. Is that what is done here ? How the fine tuning procedure works ? I cannot see any additional layer or freezing /un freezing ... It seems that the whole thing is trained from scratch ?