HKUNLP / ChunkLlama

[ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"
Apache License 2.0
325 stars 17 forks source link

Why the pretraining_length = 32384 #20

Closed relic-yuexi closed 3 weeks ago

relic-yuexi commented 1 month ago

Thank you for your contributions. I have a question regarding why the pretraining_length is 32384, while in https://huggingface.co/Qwen/Qwen1.5-14B-Chat/blob/main/config.json, the "max_position_embeddings" is 32768. Is there something I'm missing?

ChenxinAn-fdu commented 1 month ago

Sorry!! It should be 32768 and I have corrected this. But set it to a smaller value like 32384 will not influence the performance.

relic-yuexi commented 1 month ago

I see that you have use modeling_qwen2, but i get some error when i use Qwen2-0.5B.

It seems error in this line. Is there some error?

https://github.com/HKUNLP/ChunkLlama/blob/e2500d4251f201f2ca26e8ca3ed8a46145e10119/chunkqwen_attn_replace.py#L158

flash_attn_interface.py:51, in _flash_attn_forward(q, k, v, dropout_p, softmax_scale, causal, window_size, alibi_slopes, return_softmax)
     49 maybe_contiguous = lambda x: x.contiguous() if x.stride(-1) != 1 else x
     50 q, k, v = [maybe_contiguous(x) for x in (q, k, v)]
---> 51 out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
     52     q,
     53     k,
     54     v,
     55     None,
     56     alibi_slopes,
     57     dropout_p,
     58     softmax_scale,
     59     causal,
     60     window_size[0],
     61     window_size[1],
     62     return_softmax,
     63     None,
     64 )
     65 return out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state

RuntimeError: Number of heads in key/value must divide number of heads in query
ChenxinAn-fdu commented 1 month ago

Thank u for letting me know! The code is tested on Qwen2-7b. Can you provide the shape of (q_states_intra, k_states_prev, v_states_prev) for me?

relic-yuexi commented 1 month ago

Sorry, can u give me your transformers version, i upgrade it and get some error.

relic-yuexi commented 1 month ago

Maybe you can try the follow code

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "4"

from transformers import AutoTokenizer, AutoModelForCausalLM
from chunkqwen_attn_replace import replace_with_chunkqwen
import torch

replace_with_chunkqwen(pretraining_length=131072) 

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B", attn_implementation="flash_attention_2", trust_remote_code=True, torch_dtype=torch.bfloat16).to("cuda")
inputs = tokenizer("Long...docs\n Q: How to extend the context window of LLMs? ", return_tensors="pt").to(model.device)
print(f"Length of input: {inputs['input_ids'].shape[1]}")
output_ids = model.generate(**inputs, max_length=128)[0]
print(tokenizer.decode(output_ids))
relic-yuexi commented 1 month ago
python test_ppl.py --seq_len 16384 --scale 7b --data_path pg19_llama2.validation.bin

get error:

    if seq_len > self.max_seq_len:
RuntimeError: Boolean value of Tensor with more than one value is ambiguous
        print(seq_len)
        print(self.max_seq_len)

tensor([[    0,     1,     2,  ..., 16381, 16382, 16383]], device='cuda:0')
4096
relic-yuexi commented 1 month ago
image
ChenxinAn-fdu commented 1 month ago

Ok I think this is caused by the version of transformers. I use transformers==4.37.2. The head num of K,V seems to be wrongly repeated. Please comment the repeat_kv(Line 130 and 131) function.