Open baihuajun24 opened 1 week ago
This modification is due to the use of pre-allocated KV cache to optimize the efficiency of the base model (this part of the code refers to Medusa). In the cat operation at https://github.com/SafeAILab/EAGLE/blob/667ba930db7ea0075421f3c7df94ffbc10b93805/eagle/model/modeling_llama_kv.py#L591-L592 the key and value of the current token have already been cached into past_key_value, so there is no need to return the key and value of the current token for operations outside the model. This modification itself will not affect model performance, but if you do not reset the length attribute of the KV cache after a generation, it will result in abnormal generation.
Hello Eagle Team! I noticed you modified past_key_value in https://github.com/SafeAILab/EAGLE/blob/667ba930db7ea0075421f3c7df94ffbc10b93805/eagle/model/modeling_llama_kv.py#L594 by setting it to None in forward function, comparing with the source code https://github.com/huggingface/transformers/blob/e51d7ac70ab8f3e69d3659226aa838308a668238/src/transformers/models/llama/modeling_llama.py#L324 Could you provide some insights why you made such changes? I am trying to generating responses with code-llama-7b with EAGLE's KVLlamaForCausalLM class, but the results are much lower quality than results I got with default AutoModelForCausalLM class. I suspect the kv cache affects the generation.