AviSoori1x / seemore

From scratch implementation of a vision language model in pure PyTorch
MIT License
161 stars 14 forks source link

current_output gets positional embeddings added to it multiple times in LM generate()? #3

Open thuann2cats opened 3 months ago

thuann2cats commented 3 months ago

Hi Avi, thanks for your tutorial on building VLM from scratch. It was fascinating and helped me learn a lot about VLM. One question on this part in the decoder language model's generate() function:

current_output = torch.cat((current_output, idx_next_emb), dim=1) https://github.com/AviSoori1x/seemore/blob/main/seemore_from_Scratch.ipynb

On this line, "current_output" gets concatenated with embedding of the next generated token. Then that "current_output" gets passed through the whole network again, including positional embedding addition. But that would mean our "current_output" gets added with positional embedding multiple times? Is that what you intended?

I would think we need to concatenate "idx_next_emb" to a "current" tensor that we keep track of and that is BEFORE adding the positional embedding. That way, we don't accumulate positional embedding additions from one generation iteration to the next iteration. What do you think?