current_output gets positional embeddings added to it multiple times in LM generate()?

Hi Avi, thanks for your tutorial on building VLM from scratch. It was fascinating and helped me learn a lot about VLM. One question on this part in the decoder language model's generate() function:

current_output = torch.cat((current_output, idx_next_emb), dim=1) https://github.com/AviSoori1x/seemore/blob/main/seemore_from_Scratch.ipynb

On this line, "current_output" gets concatenated with embedding of the next generated token. Then that "current_output" gets passed through the whole network again, including positional embedding addition. But that would mean our "current_output" gets added with positional embedding multiple times? Is that what you intended?

I would think we need to concatenate "idx_next_emb" to a "current" tensor that we keep track of and that is BEFORE adding the positional embedding. That way, we don't accumulate positional embedding additions from one generation iteration to the next iteration. What do you think?

AviSoori1x / seemore

current_output gets positional embeddings added to it multiple times in LM generate()? #3