💡 [REQUEST] - Streaming LLM Support, or Any Better Solution?

JianxinMa commented 12 months ago

起始日期 | Start Date

No response

实现PR | Implementation PR

I'm opening this issue here so that we can track progress on the long-context extension with minimal VRAM requirements. Many users have been experiencing out-of-memory (OOM) issues when dealing with long documents.

Please let me know if you need help. I will see if I can be of any use.

摘要 | Summary

OOM with handling a long sequence.

基本示例 | Basic Example

https://github.com/QwenLM/Qwen-Agent/issues/22#issuecomment-1751868858 https://github.com/QwenLM/Qwen-Agent/issues/3#issue-1914489384

缺陷 | Drawbacks

It may involve writing CUDA kernels and can be challenging to implement efficiently.

未解决问题 | Unresolved questions

No response

Sanster commented 12 months ago

I have added support for the StreamingLLM model of QWen in the attention_sinks project, you can refer to here: https://github.com/tomaarsen/attention_sinks/pull/15

StreamingLLM can indeed reduce the use of VRAM, but it cannot expand the context window(like rope_scaling) of the model itself. The author of StreamingLLM has a more detailed explanation in the FAQ: https://github.com/mit-han-lab/streaming-llm#faq

panjican commented 11 months ago

@Sanster ``您好，我使用了attention_sinks的示范代码，但输入长文本的时候依然OOM，请问是哪里使用不当吗？期待得到您的回复~使用的模型为：Qwen-14B-Chat-Int4 from transformers import AutoTokenizer, TextStreamer, BertModel, BertTokenizer from attention_sinks import AutoModelForCausalLM device_map = "cuda:1"

model = AutoModelForCausalLM.from_pretrained(
    args.checkpoint_path,
    device_map=device_map,
    trust_remote_code=True,
    resume_download=True,
    attention_sink_size=4,
    attention_sink_window_size=252,
).eval()
tokenizer = AutoTokenizer.from_pretrained(args.checkpoint_path,trust_remote_code=True)
tokenizer.pad_token_id = tokenizer.eos_token_id`

#/v1/chat/completions接口中文本生成部分参考示例代码修改为如下代码
input_ids = tokenizer.encode(query, return_tensors="pt").to(model.device)
with torch.no_grad():
    # A TextStreamer prints tokens as they're being generated
    streamer = TextStreamer(tokenizer)
    generated_tokens = model.generate(
    input_ids,
    generation_config=GenerationConfig(
        # use_cache=True is required, the rest can be changed up.
        use_cache=True,
        min_new_tokens=100_000,
        max_new_tokens=1_000_000,
        penalty_alpha=0.6,
        top_k=5,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    ),
    streamer=streamer,
    )
    output_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)

报错日志如下

jklj077 commented 6 months ago

Qwen2 will support GQA and sliding window attention to address the memory requirements of long sequences. There are currently no plan to backport them to Qwen(1.0).

ehuaa commented 3 months ago

Qwen2 will support GQA and sliding window attention to address the memory requirements of long sequences. There are currently no plan to backport them to Qwen(1.0).

@jklj077 Although Qwen2 has implemented sliding window attention, the default config of Qwen2-7B or 72B both set the parameter use_sliding_window False, which can not use SWA by default. I' m wonder that if it may cause the accuracy drop when we use sliding window attention.

QwenLM / Qwen