Model Config settings for Llama-based architectures

MDK8888 / GPTFast

Accelerate your Hugging Face Transformers 7.6-9x. Native to Hugging Face and PyTorch.

Apache License 2.0

686 stars 65 forks source link

Model Config settings for Llama-based architectures #22

Open jamesoneill12 opened 7 months ago

jamesoneill12 commented 7 months ago

Hi there,

Thanks for creating this repo. I wanted to know what should be for Llama-2-7b-chat-hf if its the below for gpt and opt arhitectures ?

    "gpt": {
        "path_to_blocks": ["transformer", "h"],
        "child_ref_in_parent_forward": ["transformer", "block"],
    },
    "opt": {
        "path_to_blocks": ["model", "decoder", "layers"],
        "child_ref_in_parent_forward": ["model.decoder", "decoder", "decoder_layer"],
    }

I think it's something close to

    "llama": {
        "path_to_blocks": ["model", "layers"],
        "child_ref_in_parent_forward": ["model", "decoder_layer"], 
    }

but running into the following error

File "/GPTFast/Helpers/Class/add_str_as_func.py", line 9, in add_str_as_func func_code = compile(complete_func_str, "", "exec") File "", line 19 input_pos: Optional[torch.Tensor] = None

So the parsing of the code string is somehow getting miscorrectly matched at "decoder_layer". Any help would be appreciated for this to be able to work on the LLama architectures using this code.

MDK8888 commented 7 months ago

Hey James, Llama actually already supports static key-value caching natively within transformers. Will put up a fix in the next few days so that models with static key-value caching natively enabled can also integrate into GPTFast.

jamesoneill12 commented 7 months ago

Oh that's awesome! Not completely related, but I've noticed meta-llama/LlamaGuard-7b is super fast out of the box for guardrailing (0.09-0.13 second inference for 100 max new tokens with input token length 400 for a single sample on A100 80GB GPU w/ bfloat16 dtype) but I'm not seeing the same on other Llama architectures such as Llama-2-7b-chat-hf. Do you know if some of the Llama arcs have some inference optimization behind the scenes apart from kv caching ?

MDK8888 commented 7 months ago

Hey, apologies for the late response-that is very interesting indeed! I would have to investigate how LlamaGuard-7b works under the hood to answer :)

jamesoneill12 commented 7 months ago

No problem! That would be great actually, even if it supported in Transformers

MDK8888 commented 6 months ago

Hey James, this week is incredibly busy for me. I will do my best to have a new branch with the fixes up this weekend, if not, early next week.

jamesoneill12 commented 6 months ago

No problem at all, can't wait for the release!

MDK8888 commented 6 months ago

Hey James, I just pushed up my changes on the branch LlamaIntegration. The example for how it works with TinyLlama is under Examples.llama, but I don't have the GPU bandwidth to test on larger models. Let me know if my changes work with the specific Llama model that you had in mind, and I'll fix it asap if not. Thanks once again for pointing this out to me :)

jamesoneill12 commented 6 months ago

Fantastic @MDK8888 !! Can't wait to try this out, I'll let you know if there's anything to report on the larger Llama-based architectures.