Closed JianxinMa closed 6 months ago
I have added support for the StreamingLLM model of QWen in the attention_sinks project, you can refer to here: https://github.com/tomaarsen/attention_sinks/pull/15
StreamingLLM can indeed reduce the use of VRAM, but it cannot expand the context window(like rope_scaling) of the model itself. The author of StreamingLLM has a more detailed explanation in the FAQ: https://github.com/mit-han-lab/streaming-llm#faq
@Sanster ``您好,我使用了attention_sinks的示范代码,但输入长文本的时候依然OOM,请问是哪里使用不当吗?期待得到您的回复~使用的模型为:Qwen-14B-Chat-Int4 from transformers import AutoTokenizer, TextStreamer, BertModel, BertTokenizer from attention_sinks import AutoModelForCausalLM device_map = "cuda:1"
model = AutoModelForCausalLM.from_pretrained(
args.checkpoint_path,
device_map=device_map,
trust_remote_code=True,
resume_download=True,
attention_sink_size=4,
attention_sink_window_size=252,
).eval()
tokenizer = AutoTokenizer.from_pretrained(args.checkpoint_path,trust_remote_code=True)
tokenizer.pad_token_id = tokenizer.eos_token_id`
#/v1/chat/completions接口中文本生成部分参考示例代码修改为如下代码
input_ids = tokenizer.encode(query, return_tensors="pt").to(model.device)
with torch.no_grad():
# A TextStreamer prints tokens as they're being generated
streamer = TextStreamer(tokenizer)
generated_tokens = model.generate(
input_ids,
generation_config=GenerationConfig(
# use_cache=True is required, the rest can be changed up.
use_cache=True,
min_new_tokens=100_000,
max_new_tokens=1_000_000,
penalty_alpha=0.6,
top_k=5,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
),
streamer=streamer,
)
output_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
报错日志如下
Qwen2 will support GQA and sliding window attention to address the memory requirements of long sequences. There are currently no plan to backport them to Qwen(1.0).
Qwen2 will support GQA and sliding window attention to address the memory requirements of long sequences. There are currently no plan to backport them to Qwen(1.0).
@jklj077 Although Qwen2 has implemented sliding window attention, the default config of Qwen2-7B or 72B both set the parameter use_sliding_window False, which can not use SWA by default. I' m wonder that if it may cause the accuracy drop when we use sliding window attention.
起始日期 | Start Date
No response
实现PR | Implementation PR
I'm opening this issue here so that we can track progress on the long-context extension with minimal VRAM requirements. Many users have been experiencing out-of-memory (OOM) issues when dealing with long documents.
Please let me know if you need help. I will see if I can be of any use.
相关Issues | Reference Issues
No response
摘要 | Summary
OOM with handling a long sequence.
基本示例 | Basic Example
https://github.com/QwenLM/Qwen-Agent/issues/22#issuecomment-1751868858 https://github.com/QwenLM/Qwen-Agent/issues/3#issue-1914489384
缺陷 | Drawbacks
It may involve writing CUDA kernels and can be challenging to implement efficiently.
未解决问题 | Unresolved questions
No response