grimoire commented 3 months ago

Add kernel in pytorch/kernels/<device name>
Update StepContext in pytorch/engine/deivces/<device name>
Add rewrite module and register in pytorch/models/module_map.py XXX_MODULE_MAP
Enable device in PytorchEngineConfig

requirement

[x] #1499

lvhan028 commented 3 months ago

@zhulinJulia24 This PR brings in a minor reduction in inference speed. Could you help verify the quantity? The candidating models are llama3-7b, mixtral-moe-7x8b, internlm2-20b and llama3-70b

lvhan028 commented 3 months ago

The models' evaluation has to be performed, too. @zhulinJulia24

zhyncs commented 3 months ago

The models' evaluation has to be performed, too. @zhulinJulia24

Do we really need to do evaluation? For example, may we verify the basic correctness by comparing the temperature at 0 and the results of transformers, and only do evaluation when it is necessary? Of course, this is just my suggestion. I think if resources are sufficient, running an evaluation task quickly should not be a problem.

zhulinJulia24 commented 3 months ago

https://github.com/InternLM/lmdeploy/actions/runs/9600920652

@grimoire @lvhan028 llama3 loss 10+points in precision evaluation, internlm2-chat-20b has 5+ points compared to hf transforms's precision.

all precision are improved compare to https://github.com/zhulinJulia24/lmdeploy/actions/runs/9240064913 which is 0.4.2 version's precision.

zhulinJulia24 commented 3 months ago

@zhulinJulia24 This PR brings in a minor reduction in inference speed. Could you help verify the quantity? The candidating models are llama3-7b, mixtral-moe-7x8b, internlm2-20b and llama3-70b

llama3-70b

lmdeploy serve api_server /mnt/models-new/llm_models/models--meta-llama--Meta-Llama-3-70B-Instruct/snapshots/0cac6d727e4cdf117e1bde11e4c7badd8b963919 --server-port 24555 --tp 4 --backend pytorch

concurrency: 256 elapsed_time: 483.240s

first_token latency(min, max, ave): 2.923s, 371.669s, 29.348s

number of prompt tokens: 447592 number of completion tokens: 404681 token throughput (completion token): 837.433 token/s token throughput (prompt + completion token): 1763.666 token/s RPS (request per second): 4.139 req/s RPM (request per minute): 248.324 req/min

concurrency: 128 elapsed_time: 489.771s

first_token latency(min, max, ave): 0.465s, 17.343s, 3.991s

number of prompt tokens: 447592 number of completion tokens: 404681 token throughput (completion token): 826.265 token/s token throughput (prompt + completion token): 1740.145 token/s RPS (request per second): 4.084 req/s RPM (request per minute): 245.012 req/min

mixtral-moe-7x8b

lmdeploy serve api_server /nvme/qa_test_models/mistralai/Mixtral-8x7B-Instruct-v0.1 --server-port 24555 --tp 2 --backend pytorch

concurrency: 128 elapsed_time: 345.898s

first_token latency(min, max, ave): 1.912s, 20.649s, 3.140s

number of prompt tokens: 491513 number of completion tokens: 474800 token throughput (completion token): 1372.658 token/s token throughput (prompt + completion token): 2793.634 token/s RPS (request per second): 5.782 req/s RPM (request per minute): 346.923 req/min

concurrency: 256 elapsed_time: 324.822s

first_token latency(min, max, ave): 0.248s, 236.570s, 19.633s

number of prompt tokens: 491513 number of completion tokens: 474800 token throughput (completion token): 1461.722 token/s token throughput (prompt + completion token): 2974.898 token/s RPS (request per second): 6.157 req/s RPM (request per minute): 369.433 req/min

internlm2-chat-20b

batch num_prompts RPS RPM FTL(ave)(s) FTL(min)(s) FTL(max)(s) throughput(out tok/s) throughput(total tok/s) 0 128 5000.0 7.328 439.652 2.406 1.746 13.746 1501.208 3206.576 1 256 5000.0 7.530 451.825 17.863 0.299 402.773 1542.775 3295.362

meta-Llama-3-8B-Instruct

batch num_prompts RPS RPM FTL(ave)(s) FTL(min)(s) FTL(max)(s) throughput(out tok/s) throughput(total tok/s) 0 128 5000.0 12.094 725.610 1.512 1.149 6.681 2428.294 5176.386 1 256 5000.0 11.937 716.228 11.475 0.177 96.754 2396.895 5109.453

internlm2-chat-20b and meta-Llama-3-8B-Instruct consistent with baseline of 0.4.2 version.

lvhan028 commented 3 months ago

Comparing to the previous torch engine, as shown in https://github.com/zhulinJulia24/lmdeploy/actions/runs/9240064913, the evaluation accuracy doesn't degrade, does it?

lvhan028 commented 3 months ago

After inner discussion, this PR didn't cause accuracy degradation comparing to the previous version. We'll check if there is something wrong with the evaluation config.

InternLM / lmdeploy

Device dispatcher #1775

requirement

llama3-70b

mixtral-moe-7x8b

internlm2-chat-20b

meta-Llama-3-8B-Instruct