Closed relic-yuexi closed 3 weeks ago
Sorry!! It should be 32768 and I have corrected this. But set it to a smaller value like 32384 will not influence the performance.
I see that you have use modeling_qwen2, but i get some error when i use Qwen2-0.5B.
It seems error in this line. Is there some error?
flash_attn_interface.py:51, in _flash_attn_forward(q, k, v, dropout_p, softmax_scale, causal, window_size, alibi_slopes, return_softmax)
49 maybe_contiguous = lambda x: x.contiguous() if x.stride(-1) != 1 else x
50 q, k, v = [maybe_contiguous(x) for x in (q, k, v)]
---> 51 out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
52 q,
53 k,
54 v,
55 None,
56 alibi_slopes,
57 dropout_p,
58 softmax_scale,
59 causal,
60 window_size[0],
61 window_size[1],
62 return_softmax,
63 None,
64 )
65 return out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state
RuntimeError: Number of heads in key/value must divide number of heads in query
Thank u for letting me know! The code is tested on Qwen2-7b.
Can you provide the shape of (q_states_intra, k_states_prev, v_states_prev)
for me?
Sorry, can u give me your transformers version, i upgrade it and get some error.
Maybe you can try the follow code
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "4"
from transformers import AutoTokenizer, AutoModelForCausalLM
from chunkqwen_attn_replace import replace_with_chunkqwen
import torch
replace_with_chunkqwen(pretraining_length=131072)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B", attn_implementation="flash_attention_2", trust_remote_code=True, torch_dtype=torch.bfloat16).to("cuda")
inputs = tokenizer("Long...docs\n Q: How to extend the context window of LLMs? ", return_tensors="pt").to(model.device)
print(f"Length of input: {inputs['input_ids'].shape[1]}")
output_ids = model.generate(**inputs, max_length=128)[0]
print(tokenizer.decode(output_ids))
python test_ppl.py --seq_len 16384 --scale 7b --data_path pg19_llama2.validation.bin
get error:
if seq_len > self.max_seq_len:
RuntimeError: Boolean value of Tensor with more than one value is ambiguous
print(seq_len)
print(self.max_seq_len)
tensor([[ 0, 1, 2, ..., 16381, 16382, 16383]], device='cuda:0')
4096
Ok I think this is caused by the version of transformers. I use transformers==4.37.2
.
The head num of K,V seems to be wrongly repeated.
Please comment the repeat_kv
(Line 130 and 131) function.
Thank you for your contributions. I have a question regarding why the pretraining_length is 32384, while in https://huggingface.co/Qwen/Qwen1.5-14B-Chat/blob/main/config.json, the "max_position_embeddings" is 32768. Is there something I'm missing?