Loss function during pre-training the Projector

AviSoori1x / seemore

From scratch implementation of a vision language model in pure PyTorch

MIT License

147 stars 11 forks source link

So,This repo help me a lot I even read your blog and this part

Get pretrained vision encoder from SigLIP or CLIP (both come in difference sizes). Freeze weights (i.e. don’t update during backward pass in training).
Get pretrained decoder only language model e.g. all the way from TinyLLaMA, Phi-2 etc. to Llama 3 (or even much bigger in the case of GPT-4 and Grok 1.5 etc.). Freeze weights.
Implement a projection module and train a VLM module much like what we have here, but only updating the weights of this projection module. This would effectively be the pretraining phase.

we only pre-train the projector in this case right but my question is do we have a loss function when back-propogating through the projector or not ?

Is it a contrastive loss or what ?

AviSoori1x / seemore