Input tokens for the generate function

jblamare commented 12 months ago

Hello and thank you for your work on this repo!

I had a question about the generate function as implemented in the MetaModel. In particular, on this line where the logits are computed:

logits = self.llma.forward_inference(tokens[:, prev_pos:cur_pos], prev_pos, images if prev_pos == 0 else None)

Assuming I have a single prompt with 10 tokens (`start_pos = min_prompt_size = 10`), the values for prev_pos/cur_pos will be the following over the first few iterations:	prev_pos	cur_pos
0	10
10	11
11	12

Meaning that apart from the first token, tokens[:, prev_pos:cur_pos] will always be a single token. Doesn't that mean that the model does not have access to the full prompt & history when generating a new token? Shouldn't it be tokens[:,:cur_pos]?

I think I'm probably missing something since the generation actually works fine, but I can't see what.

ChrisLiu6 commented 11 months ago

Hi, a cache mechanism exists in the implementation of every model in acccessory/model/LLM. For example, in llama.py, you can find the cache here.

Briefly, in each attention layer, the cache stores the key & values of past tokens, and then when new tokens together with the start_pos argument come, they will attend to all cached keys and values before start_pos.

jblamare commented 11 months ago

Ah that makes more sense now, thank you for your answer!

Alpha-VLLM / LLaMA2-Accessory

Input tokens for the generate function #68