Open jamesoneill12 opened 7 months ago
Hey James, Llama actually already supports static key-value caching natively within transformers. Will put up a fix in the next few days so that models with static key-value caching natively enabled can also integrate into GPTFast.
Oh that's awesome! Not completely related, but I've noticed meta-llama/LlamaGuard-7b is super fast out of the box for guardrailing (0.09-0.13 second inference for 100 max new tokens with input token length 400 for a single sample on A100 80GB GPU w/ bfloat16 dtype) but I'm not seeing the same on other Llama architectures such as Llama-2-7b-chat-hf. Do you know if some of the Llama arcs have some inference optimization behind the scenes apart from kv caching ?
Hey, apologies for the late response-that is very interesting indeed! I would have to investigate how LlamaGuard-7b works under the hood to answer :)
No problem! That would be great actually, even if it supported in Transformers
Hey James, this week is incredibly busy for me. I will do my best to have a new branch with the fixes up this weekend, if not, early next week.
No problem at all, can't wait for the release!
Hey James, I just pushed up my changes on the branch LlamaIntegration. The example for how it works with TinyLlama is under Examples.llama, but I don't have the GPU bandwidth to test on larger models. Let me know if my changes work with the specific Llama model that you had in mind, and I'll fix it asap if not. Thanks once again for pointing this out to me :)
Fantastic @MDK8888 !! Can't wait to try this out, I'll let you know if there's anything to report on the larger Llama-based architectures.
Hi there,
Thanks for creating this repo. I wanted to know what should be for Llama-2-7b-chat-hf if its the below for gpt and opt arhitectures ?
I think it's something close to
but running into the following error
File "/GPTFast/Helpers/Class/add_str_as_func.py", line 9, in add_str_as_func func_code = compile(complete_func_str, "", "exec")
File "", line 19
input_pos: Optional[torch.Tensor] = None
So the parsing of the code string is somehow getting miscorrectly matched at "decoder_layer". Any help would be appreciated for this to be able to work on the LLama architectures using this code.