[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
start_time = time.time()
with requests.post(url, headers=headers, json=data, stream=True) as response:
first_frame_time = None
for _ in response.iter_lines(decode_unicode=True):
if first_frame_time is None:
# 记录收到第一帧的时间
first_frame_time = time.time() - start_time
# print(f"Thread-{thread_id}: {line}")
# 打印每一行,或执行其他逻辑
# print(f"Thread-{thread_id}: {line}")
end_time = time.time()
return (thread_id, first_frame_time, end_time - start_time)
def run_threads(n_threads, contents, start_id):
with ThreadPoolExecutor(max_workers=n_threads) as executor:
futures = [executor.submit(make_request, start_id + i, contents[i]) for i in range(n_threads)]
for future in as_completed(futures):
thread_id, first_frame_time, total_time = future.result()
print(f"Thread-{thread_id} 第一帧: {first_frame_time:.3f} 秒, 总时间: {total_time:.3f} 秒")
if name == 'main':
contents = ["","","","",""] # 每个线程发送不同的内容
n_threads = len(contents)
start_id = 1001 # 启动的线程名起始值
print(f"\nRunning with {n_threads} threads:")
run_threads(n_threads, contents, start_id)`
Checklist
Describe the bug
如图所示,我采用5个线程1001-1005同时发送请求,每个请求内容不同,并且每个请求的输入token都在8w左右,开启了缓存 这是第一次未命中缓存,5个请求的响应时间
这是第二次执行,但是修改了第一个请求1001的输入内容,使其无法命中缓存,理论上应该只有他无法命中缓存需要较长的处理时间,但是事实却是只有1002的第一帧数据很快到达,1003-1005第一帧和最后一帧数据基本上同时在30s左右到达,很明显另外四个请求无法正常的处理了
注: 使用单张a800部署的qwen2.5 14b awq量化模型,开启了前缀缓存和kv cahce量化
Reproduction
首先按照下面命令运行模型:
CUDA_VISIBLE_DEVICES=5 lmdeploy serve api_server /mnt/qwen2.5/qwen14bInt/Qwen/Qwen2___5-14B-Instruct-AWQ --backend turbomind --server-port 35553 --model-name qwenInt4 --model-format awq --session-len 100000 --cache-block-seq-len 512 --max-batch-size 512 --enable-prefix-caching --log-level INFO --cache-max-entry-count 0.8 --quant-policy=4 >> /mnt/qwen2.5/qwenInt4/qwen14btmp1.txt 2>&1
然后通过脚本进行测试 ` import requests import time from concurrent.futures import ThreadPoolExecutor, as_completed
url = "http://localhost:35553/v1/chat/completions" headers = {'Content-Type': 'application/json'}
def make_request(thread_id, content): data = { "model": "qwenInt4", "messages": [ { "role": "user", "content": content } ], "temperature": 0.1, "top_p": 1, "max_tokens": 2000, "stream": True }
def run_threads(n_threads, contents, start_id): with ThreadPoolExecutor(max_workers=n_threads) as executor: futures = [executor.submit(make_request, start_id + i, contents[i]) for i in range(n_threads)] for future in as_completed(futures): thread_id, first_frame_time, total_time = future.result() print(f"Thread-{thread_id} 第一帧: {first_frame_time:.3f} 秒, 总时间: {total_time:.3f} 秒")
if name == 'main': contents = ["","","","",""] # 每个线程发送不同的内容 n_threads = len(contents) start_id = 1001 # 启动的线程名起始值 print(f"\nRunning with {n_threads} threads:") run_threads(n_threads, contents, start_id)`
最后在contents中放入5个不同的字符串,运行脚本得到第一次请求结果,修改contents[0]位置的内容,然后运行得到第二次请求结果
Environment
Error traceback