[Bad Case]: 为什么推理速度比9b模型都要慢很多

lixiaoyuan1029 commented 2 months ago

Description / 描述

使用vllm部署： python -m vllm.entrypoints.openai.api_server --model /data2/MiniCPM --host 0.0.0.0 --port 10999 --max-model-len 2048 --served-model-name minicpm --trust_remote_code

用evol-scope进行并发测试：结果很慢，100并发结果： Benchmarking summary: Time taken for tests: 384.675 seconds Expected number of requests: 0 Number of concurrency: 100 Total requests: 1000 Succeed requests: 1000 Failed requests: 0 Average QPS: 2.600 Average latency: 36.727 Throughput(average output tokens per second): 1225.085 Average time to first token: 36.727 Average input tokens per request: 911.000 Average output tokens per request: 471.259 Average time per output token: 0.00082 Average package per request: 1.000 Average package latency: 36.727

为啥参数量这么小还这么慢呢，是我哪里搞错了吗

Case Explaination / 案例解释

No response

LDLINGLINGLING commented 2 months ago

你好，这是100并发做的么？我之前单并发100次估计也比这个快啊。

yaleimeng commented 2 months ago

别提了。。默认情况下，估计比200B模型都慢。

OpenBMB / MiniCPM

[Bad Case]: 为什么推理速度比9b模型都要慢很多 #191

Description / 描述

Case Explaination / 案例解释