OpenBMB / MiniCPM

MiniCPM3-4B: An edge-side LLM that surpasses GPT-3.5-Turbo.
Apache License 2.0
7.13k stars 454 forks source link

[Bad Case]: 为什么推理速度比9b模型都要慢很多 #191

Closed lixiaoyuan1029 closed 2 months ago

lixiaoyuan1029 commented 2 months ago

Description / 描述

使用vllm部署: python -m vllm.entrypoints.openai.api_server --model /data2/MiniCPM --host 0.0.0.0 --port 10999 --max-model-len 2048 --served-model-name minicpm --trust_remote_code

用evol-scope进行并发测试: 结果很慢,100并发结果: Benchmarking summary: Time taken for tests: 384.675 seconds Expected number of requests: 0 Number of concurrency: 100 Total requests: 1000 Succeed requests: 1000 Failed requests: 0 Average QPS: 2.600 Average latency: 36.727 Throughput(average output tokens per second): 1225.085 Average time to first token: 36.727 Average input tokens per request: 911.000 Average output tokens per request: 471.259 Average time per output token: 0.00082 Average package per request: 1.000 Average package latency: 36.727

为啥参数量这么小还这么慢呢,是我哪里搞错了吗

Case Explaination / 案例解释

No response

LDLINGLINGLING commented 2 months ago

你好,这是100并发做的么?我之前单并发100次估计也比这个快啊。

yaleimeng commented 2 months ago

别提了。。默认情况下,估计比200B模型都慢。