Closed dame-cell closed 4 months ago
We use cross entropy loss just like in training a regular decoder only model. This works because it's next token prediction in an autoregressive manner at the end of the day. So once we abstract away the following: vision encoder + projector+ decoder model and consider the entire thing to be the 'model' in the training loop, this makes sense. Also read the text corresponding to Stage 1 in page 5 of the LLaVA paper: https://arxiv.org/pdf/2304.08485
So,This repo help me a lot I even read your blog and this part
Get pretrained vision encoder from SigLIP or CLIP (both come in difference sizes). Freeze weights (i.e. don’t update during backward pass in training).
Get pretrained decoder only language model e.g. all the way from TinyLLaMA, Phi-2 etc. to Llama 3 (or even much bigger in the case of GPT-4 and Grok 1.5 etc.). Freeze weights.
Implement a projection module and train a VLM module much like what we have here, but only updating the weights of this projection module. This would effectively be the pretraining phase.
we only pre-train the projector in this case right but my question is do we have a loss function when back-propogating through the projector or not ?
Is it a contrastive loss or what ?