Closed uebian closed 2 months ago
Hi! Thanks for your interest!
Could you please try to print out the loaded models' architecture? The modification failure is triggered by cannot find the target module. In your case, for gemma-2, Gemma2ForCausalLM should be modified rather than GemmaForCausalLM. They have different classes in hugging transformers. We haven't implement SelfExtend for gemma-2 yet.
Another possible problem is : almost every version of hugging transformers has some changes to the {Model_name}ForCausalLM class. We will check the newest implementation for Gemma in hugging transformers and release a new one if needed.
Hi, thank you for getting back. I'm using gemma-2b-it
, which is kind of gemma 1 model instead of gemma 2. That model can be downloaded from https://huggingface.co/google/gemma-2b-it
Sorry for the oversight. Could you please share the output of: print(loaded_model)? This should print out the name of all modules in the loaded model.
Sorry for the delayed response, I have printed the information that might be helpful to figure out the issue: Code:
import warnings
warnings.filterwarnings("ignore")
import torch
import json
import time
from transformers.models.llama.modeling_llama import LlamaAttention
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import SelfExtend
window_size = 1024
group_size = 32
model_id = '/tmp/gemma-2b-it/'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
print(model)
Output:
GemmaForCausalLM(
(model): GemmaModel(
(embed_tokens): Embedding(256000, 2048, padding_idx=0)
(layers): ModuleList(
(0-17): 18 x GemmaDecoderLayer(
(self_attn): GemmaSdpaAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear(in_features=2048, out_features=256, bias=False)
(v_proj): Linear(in_features=2048, out_features=256, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
(rotary_emb): GemmaRotaryEmbedding()
)
(mlp): GemmaMLP(
(gate_proj): Linear(in_features=2048, out_features=16384, bias=False)
(up_proj): Linear(in_features=2048, out_features=16384, bias=False)
(down_proj): Linear(in_features=16384, out_features=2048, bias=False)
(act_fn): PytorchGELUTanh()
)
(input_layernorm): GemmaRMSNorm()
(post_attention_layernorm): GemmaRMSNorm()
)
)
(norm): GemmaRMSNorm()
)
(lm_head): Linear(in_features=2048, out_features=256000, bias=False)
)
I also found that CodeLlama can not be loaded. Model structure:
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32016, 4096)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaSdpaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=4096, out_features=32016, bias=False)
)
Error log:
Traceback (most recent call last):
File "/home/ubuntu/LongLM/test.py", line 27, in <module>
SelfExtend.apply(model, group_size, window_size)
File "/home/ubuntu/LongLM/SelfExtend.py", line 123, in apply
raise Exception(f"Failed to modify the attention method of {arch_name}")
Exception: Failed to modify the attention method of LlamaForCausalLM
Seems the modification failure is caused by the change of default attention module. The modification function assumes that the default attention module is "LlamaAttention"/"GemmaAttention", however, it's actually "LlamaSdpaAttention"/"GemmaSdpaAttention". You may refer: https://github.com/datamllab/LongLM/issues/23#issuecomment-1986716092
Yes, by replacing "LlamaAttention" with "LlamaSdpaAttention", it works. Thank you very much.
FYI: below is the patch I applied:
diff --git a/SelfExtend.py b/SelfExtend.py
index 8f294fa..2aee66d 100644
--- a/SelfExtend.py
+++ b/SelfExtend.py
@@ -116,9 +116,9 @@ def apply(loaded_model, group_size, window_size, enable_flash_attention=False, s
group_size_1=group_size,
group_size_2=window_size,
scale_base=scale_base)
- # after the default version of attention in 4.36 is LlamaSpdaAttention, but in before 4,36 or in 4.38, it is LlamaAttention
+ # after the default version of attention in 4.36 is LlamaSdpaAttention, but in before 4,36 or in 4.38, it is LlamaAttention
# print("loaded_model", loaded_model)
- modifed_2 = modify_method_of_instance(loaded_model, "LlamaAttention", "forward", self_extend_attention_forward)
+ modifed_2 = modify_method_of_instance(loaded_model, "LlamaSdpaAttention", "forward", self_extend_attention_forward)
if not modifed_2:
raise Exception(f"Failed to modify the attention method of {arch_name}")
elif 'Mistral' in arch_name:
for CPU users, use fork from #25
it worked for me
I found that the current version of LongLM can not load Gemma 1 or Gemma 2 model successfully. I wrote a minimum test to help reproduce the issue:
While trying to load the model, it fails with the error message below:
I found that it fails in the duplicate check in the L24 of SelfExtend.py. When it fails,
instance = False
.Below is a
conda env export
dump including package details in my Python environment: