Hi Avi, thanks for your tutorial on building VLM from scratch. It was fascinating and helped me learn a lot about VLM. One question on this part in the decoder language model's generate() function:
On this line, "current_output" gets concatenated with embedding of the next generated token. Then that "current_output" gets passed through the whole network again, including positional embedding addition. But that would mean our "current_output" gets added with positional embedding multiple times? Is that what you intended?
I would think we need to concatenate "idx_next_emb" to a "current" tensor that we keep track of and that is BEFORE adding the positional embedding. That way, we don't accumulate positional embedding additions from one generation iteration to the next iteration. What do you think?
Hi Avi, thanks for your tutorial on building VLM from scratch. It was fascinating and helped me learn a lot about VLM. One question on this part in the decoder language model's generate() function:
current_output = torch.cat((current_output, idx_next_emb), dim=1) https://github.com/AviSoori1x/seemore/blob/main/seemore_from_Scratch.ipynb
On this line, "current_output" gets concatenated with embedding of the next generated token. Then that "current_output" gets passed through the whole network again, including positional embedding addition. But that would mean our "current_output" gets added with positional embedding multiple times? Is that what you intended?
I would think we need to concatenate "idx_next_emb" to a "current" tensor that we keep track of and that is BEFORE adding the positional embedding. That way, we don't accumulate positional embedding additions from one generation iteration to the next iteration. What do you think?