AviSoori1x / seemore

From scratch implementation of a vision language model in pure PyTorch
MIT License
147 stars 11 forks source link

Loss function during pre-training the Projector #1

Closed dame-cell closed 4 months ago

dame-cell commented 4 months ago

So,This repo help me a lot I even read your blog and this part

we only pre-train the projector in this case right but my question is do we have a loss function when back-propogating through the projector or not ?

Is it a contrastive loss or what ?

AviSoori1x commented 4 months ago

We use cross entropy loss just like in training a regular decoder only model. This works because it's next token prediction in an autoregressive manner at the end of the day. So once we abstract away the following: vision encoder + projector+ decoder model and consider the entire thing to be the 'model' in the training loop, this makes sense. Also read the text corresponding to Stage 1 in page 5 of the LLaVA paper: https://arxiv.org/pdf/2304.08485