Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
https://llama2-accessory.readthedocs.io/
Other
2.68k stars 170 forks source link

Input tokens for the generate function #68

Closed jblamare closed 11 months ago

jblamare commented 12 months ago

Hello and thank you for your work on this repo!

I had a question about the generate function as implemented in the MetaModel. In particular, on this line where the logits are computed:

logits = self.llma.forward_inference(tokens[:, prev_pos:cur_pos], prev_pos, images if prev_pos == 0 else None)
Assuming I have a single prompt with 10 tokens (start_pos = min_prompt_size = 10), the values for prev_pos/cur_pos will be the following over the first few iterations: prev_pos cur_pos
0 10
10 11
11 12

Meaning that apart from the first token, tokens[:, prev_pos:cur_pos] will always be a single token. Doesn't that mean that the model does not have access to the full prompt & history when generating a new token? Shouldn't it be tokens[:,:cur_pos]?

I think I'm probably missing something since the generation actually works fine, but I can't see what.

ChrisLiu6 commented 11 months ago

Hi, a cache mechanism exists in the implementation of every model in acccessory/model/LLM. For example, in llama.py, you can find the cache here.

Briefly, in each attention layer, the cache stores the key & values of past tokens, and then when new tokens together with the start_pos argument come, they will attend to all cached keys and values before start_pos.

jblamare commented 11 months ago

Ah that makes more sense now, thank you for your answer!