Closed grimoire closed 3 months ago
@zhulinJulia24
This PR brings in a minor reduction in inference speed.
Could you help verify the quantity?
The candidating models are llama3-7b
, mixtral-moe-7x8b
, internlm2-20b
and llama3-70b
The models' evaluation has to be performed, too. @zhulinJulia24
The models' evaluation has to be performed, too. @zhulinJulia24
Do we really need to do evaluation? For example, may we verify the basic correctness by comparing the temperature at 0 and the results of transformers, and only do evaluation when it is necessary? Of course, this is just my suggestion. I think if resources are sufficient, running an evaluation task quickly should not be a problem.
https://github.com/InternLM/lmdeploy/actions/runs/9600920652
@grimoire @lvhan028 llama3 loss 10+points in precision evaluation, internlm2-chat-20b has 5+ points compared to hf transforms's precision.
all precision are improved compare to https://github.com/zhulinJulia24/lmdeploy/actions/runs/9240064913 which is 0.4.2 version's precision.
@zhulinJulia24 This PR brings in a minor reduction in inference speed. Could you help verify the quantity? The candidating models are
llama3-7b
,mixtral-moe-7x8b
,internlm2-20b
andllama3-70b
lmdeploy serve api_server /mnt/models-new/llm_models/models--meta-llama--Meta-Llama-3-70B-Instruct/snapshots/0cac6d727e4cdf117e1bde11e4c7badd8b963919 --server-port 24555 --tp 4 --backend pytorch
concurrency: 256 elapsed_time: 483.240s
first_token latency(min, max, ave): 2.923s, 371.669s, 29.348s
number of prompt tokens: 447592 number of completion tokens: 404681 token throughput (completion token): 837.433 token/s token throughput (prompt + completion token): 1763.666 token/s RPS (request per second): 4.139 req/s RPM (request per minute): 248.324 req/min
concurrency: 128 elapsed_time: 489.771s
first_token latency(min, max, ave): 0.465s, 17.343s, 3.991s
number of prompt tokens: 447592 number of completion tokens: 404681 token throughput (completion token): 826.265 token/s token throughput (prompt + completion token): 1740.145 token/s RPS (request per second): 4.084 req/s RPM (request per minute): 245.012 req/min
lmdeploy serve api_server /nvme/qa_test_models/mistralai/Mixtral-8x7B-Instruct-v0.1 --server-port 24555 --tp 2 --backend pytorch
concurrency: 128 elapsed_time: 345.898s
first_token latency(min, max, ave): 1.912s, 20.649s, 3.140s
number of prompt tokens: 491513 number of completion tokens: 474800 token throughput (completion token): 1372.658 token/s token throughput (prompt + completion token): 2793.634 token/s RPS (request per second): 5.782 req/s RPM (request per minute): 346.923 req/min
concurrency: 256 elapsed_time: 324.822s
first_token latency(min, max, ave): 0.248s, 236.570s, 19.633s
number of prompt tokens: 491513 number of completion tokens: 474800 token throughput (completion token): 1461.722 token/s token throughput (prompt + completion token): 2974.898 token/s RPS (request per second): 6.157 req/s RPM (request per minute): 369.433 req/min
batch num_prompts RPS RPM FTL(ave)(s) FTL(min)(s) FTL(max)(s) throughput(out tok/s) throughput(total tok/s) 0 128 5000.0 7.328 439.652 2.406 1.746 13.746 1501.208 3206.576 1 256 5000.0 7.530 451.825 17.863 0.299 402.773 1542.775 3295.362
batch num_prompts RPS RPM FTL(ave)(s) FTL(min)(s) FTL(max)(s) throughput(out tok/s) throughput(total tok/s) 0 128 5000.0 12.094 725.610 1.512 1.149 6.681 2428.294 5176.386 1 256 5000.0 11.937 716.228 11.475 0.177 96.754 2396.895 5109.453
internlm2-chat-20b and meta-Llama-3-8B-Instruct consistent with baseline of 0.4.2 version.
Comparing to the previous torch engine, as shown in https://github.com/zhulinJulia24/lmdeploy/actions/runs/9240064913, the evaluation accuracy doesn't degrade, does it?
After inner discussion, this PR didn't cause accuracy degradation comparing to the previous version. We'll check if there is something wrong with the evaluation config.
pytorch/kernels/<device name>
pytorch/engine/deivces/<device name>
pytorch/models/module_map.py
XXX_MODULE_MAP
PytorchEngineConfig
requirement