Closed deku0818 closed 6 months ago
We are going to upgrade lmdeploy.pytorch
with branch pytorch-poc
@grimoire Please verify if this is an issue in pytorch-poc
branch
@deku0818 any update on this with the pytorch-poc
branch? I am having the same issue.
Hi @lvhan028, I am trying to run LeoLM/leo-hessianai-13b-chat-bilingual
on 2 RTX 3090s, I am running the following command after cloning the repo and checking out pytorch-poc
branch, however, I get the same error as well. Anything I can try differently?
I have also tried the pytorch_poc
module but get different error there :(
@grimoire @wangruohui please follow up
Hi @lvhan028, I am trying to run
LeoLM/leo-hessianai-13b-chat-bilingual
on 2 RTX 3090s, I am running the following command after cloning the repo and checking outpytorch-poc
branch, however, I get the same error as well. Anything I can try differently?
I have also tried the
pytorch_poc
module but get different error there :(
Hello @shahrukhx01 ,
This specific error is due to missing of _turbomind binary module. you need to compile turbomind following this indtruction https://github.com/InternLM/lmdeploy/blob/main/docs/en/build.md .
his specific error is due to missi
@wangruohui Thanks for your response. Could you please tell me, for tensor parallelism with Pytorch should I use lmdeploy.pytorch_poc.chat
or lmdeploy.pytorch.chat
? Thanks!
Hi,
Based on our schedule, pyorch_poc will replace original pytorch soon.
Hi @wangruohui, @lvhan028
I use the openmmlab/lmdeploy:latest
docker image, public on https://hub.docker.com/r/openmmlab/lmdeploy/tags, compiled the lmdeploy:pytorch-poc
branch,but get OOM issue on A40 48G, the command I run is below:
root@dell:/workdir/lmdeploy/build# CUDA_VISIBLE_DEVICES=3 python3 -m lmdeploy.serve.gradio.app /workdir/lmdeploy_models/internlm-chat-20b/ 10.0.232.
24 8018 1 1
WARNING: Can not find tokenizer.json. It may take long time to initialize the tokenizer.
WARNING: Can not find tokenizer.json. It may take long time to initialize the tokenizer.
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
[TM][INFO] [LlamaCacheManager] max_entry_count = 4
[TM][INFO] [LlamaCacheManager] chunk_size = 1
[TM][INFO] [LlamaCacheManager][allocate]
[TM][INFO] [LlamaCacheManager][allocate] malloc 1
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.8/dist-packages/lmdeploy/turbomind/turbomind.py", line 198, in _create_model_instance
model_inst = self.tm_model.model.create_model_instance(
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /opt/tritonserver/lmdeploy/src/turbomind/utils/allocator.h:231
server is gonna mount on: http://10.0.232.24:8018
Running on local URL: http://10.0.232.24:8018
When I use TP=2, it seems internlm-chat-20b
needs 51.2G GPU memory, even with max_batch_size=1
and instance_num=1
config. Is this normal for a 20B model cost 51.2G memory?
Also, when I run benchmark/profile_throughput.py
, even concurrency=8 num_prompt2=2000 tp=2
will cause OOM on A40 48G. When I decrease the concurrency=4
the memory utilization is shown following:
Comparing to
internlm-chat-7b
, the increase of memory usage is really sharp, is this because internlm-chat-20b
have 16K context length? Are there any parameters to cut down the memory utilization, like limit the context length, since there isn't that long context case in the benchmark/profile_throughput.py
?
may this help fix OOM ? https://github.com/InternLM/lmdeploy/pull/1040
pytorch engine已经重构了。请使用最新的版本。
Checklist
Describe the bug
使用了deepspeed 运行internlm-chat-20B失败
Reproduction
deepspeed --module --include="localhost:0,1" lmdeploy.pytorch.chat /data/internlm-chat-20b --max_new_tokens 64 --temperture 0.8 --top_p 0.95 --seed 0
Error traceback