InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
3.6k stars 321 forks source link

[Bug] deepspeed 2张V100 32G运行internlm-chat-20B 也会oom吗 #519

Closed deku0818 closed 6 months ago

deku0818 commented 10 months ago

Checklist

Describe the bug

使用了deepspeed 运行internlm-chat-20B失败

Reproduction

deepspeed --module --include="localhost:0,1" lmdeploy.pytorch.chat /data/internlm-chat-20b --max_new_tokens 64 --temperture 0.8 --top_p 0.95 --seed 0

Error traceback

[2023-09-30 10:02:28,920] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator                                                             to cuda (auto detect)
[2023-09-30 10:02:33,309] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will p                                                            roceed with training with local resources only.
[2023-09-30 10:02:33,309] [INFO] [runner.py:570:main] cmd = /data/langchain-ChatGLM/bin/python -u                                                             -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --m                                                            aster_port=29500 --module --enable_each_rank_log=None lmdeploy.pytorch.chat /data/internlm-chat-20                                                            b --max_new_tokens 64 --temperture 0.8 --top_p 0.95 --seed 0
[2023-09-30 10:02:36,582] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator                                                             to cuda (auto detect)
[2023-09-30 10:02:40,166] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-09-30 10:02:40,166] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-09-30 10:02:40,166] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list                                                            '>, {'localhost': [0, 1]})
[2023-09-30 10:02:40,166] [INFO] [launch.py:163:main] dist_world_size=2
[2023-09-30 10:02:40,166] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-09-30 10:02:44,815] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-30 10:02:44,844] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Worker 0 logging to chat.log
Worker 1 logging to chat.log.True
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:04<00:00, 24.88s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:04<00:00, 24.90s/it]
Traceback (most recent call last):
Traceback (most recent call last):
  File "/data/langchain-ChatGLM/lib/python3.8/runpy.py", line 194, in _run_module_as_main
  File "/data/langchain-ChatGLM/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/langchain-ChatGLM/lib/python3.8/runpy.py", line 87, in _run_code
    return _run_code(code, main_globals, None,
  File "/data/langchain-ChatGLM/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/lmdeploy/pytorch/chat.py", line 212, in <module>
    exec(code, run_globals)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/lmdeploy/pytorch/chat.py", line 212, in <module>
    cli()
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/lmdeploy/pytorch/chat.py", line 208, in cli
    cli()
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/lmdeploy/pytorch/chat.py", line 208, in cli
    fire.Fire(main)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    fire.Fire(main)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component, remaining_args = _CallAndUpdateTrace(
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/lmdeploy/pytorch/chat.py", line 146, in main
    component = fn(*varargs, **kwargs)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/lmdeploy/pytorch/chat.py", line 146, in main
    model: PreTrainedModel = accel_model(model,
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/lmdeploy/pytorch/model.py", line 101, in accel_model
    model: PreTrainedModel = accel_model(model,
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/lmdeploy/pytorch/model.py", line 101, in accel_model
    model = model.cuda(gpu_id)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2042, in cuda
    model = model.cuda(gpu_id)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2042, in cuda
    return super().cuda(*args, **kwargs)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 905, in cuda
    return super().cuda(*args, **kwargs)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 905, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    return self._apply(lambda t: t.cuda(device))
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 905, in <lambda>
    param_applied = fn(param)
  File "/data/langchain-ChatGLM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 905, in <lambda>
    return self._apply(lambda t: t.cuda(device))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 1; 31.75 GiB total capacity; 31.01 GiB already allocated; 27.69 MiB free; 31.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    return self._apply(lambda t: t.cuda(device))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 31.75 GiB total capacity; 31.01 GiB already allocated; 27.69 MiB free; 31.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-09-30 10:05:24,508] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 262276
[2023-09-30 10:05:24,509] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 262277
[2023-09-30 10:05:24,511] [ERROR] [launch.py:321:sigkill_handler] ['/data/langchain-ChatGLM/bin/python', '-u', '-m', 'lmdeploy.pytorch.chat', '--local_rank=1', '/data/internlm-chat-20b', '--max_new_tokens', '64', '--temperture', '0.8', '--top_p', '0.95', '--seed', '0'] exits with return code = 1
deku0818 commented 10 months ago

image

lvhan028 commented 9 months ago

We are going to upgrade lmdeploy.pytorch with branch pytorch-poc @grimoire Please verify if this is an issue in pytorch-poc branch

shahrukhx01 commented 9 months ago

@deku0818 any update on this with the pytorch-poc branch? I am having the same issue.

shahrukhx01 commented 9 months ago

Hi @lvhan028, I am trying to run LeoLM/leo-hessianai-13b-chat-bilingual on 2 RTX 3090s, I am running the following command after cloning the repo and checking out pytorch-poc branch, however, I get the same error as well. Anything I can try differently?

Screenshot 2023-10-11 at 08 29 06

I have also tried the pytorch_poc module but get different error there :( Screenshot 2023-10-11 at 08 30 21

lvhan028 commented 9 months ago

@grimoire @wangruohui please follow up

wangruohui commented 9 months ago

Hi @lvhan028, I am trying to run LeoLM/leo-hessianai-13b-chat-bilingual on 2 RTX 3090s, I am running the following command after cloning the repo and checking out pytorch-poc branch, however, I get the same error as well. Anything I can try differently?

Screenshot 2023-10-11 at 08 29 06

I have also tried the pytorch_poc module but get different error there :( Screenshot 2023-10-11 at 08 30 21

Hello @shahrukhx01 ,

This specific error is due to missing of _turbomind binary module. you need to compile turbomind following this indtruction https://github.com/InternLM/lmdeploy/blob/main/docs/en/build.md .

shahrukhx01 commented 9 months ago

his specific error is due to missi

@wangruohui Thanks for your response. Could you please tell me, for tensor parallelism with Pytorch should I use lmdeploy.pytorch_poc.chat or lmdeploy.pytorch.chat? Thanks!

wangruohui commented 9 months ago

Hi,

Based on our schedule, pyorch_poc will replace original pytorch soon.

gesanqiu commented 9 months ago

Hi @wangruohui, @lvhan028 I use the openmmlab/lmdeploy:latest docker image, public on https://hub.docker.com/r/openmmlab/lmdeploy/tags, compiled the lmdeploy:pytorch-poc branch,but get OOM issue on A40 48G, the command I run is below:

root@dell:/workdir/lmdeploy/build# CUDA_VISIBLE_DEVICES=3 python3 -m lmdeploy.serve.gradio.app /workdir/lmdeploy_models/internlm-chat-20b/ 10.0.232.
24 8018 1 1
WARNING: Can not find tokenizer.json. It may take long time to initialize the tokenizer.
WARNING: Can not find tokenizer.json. It may take long time to initialize the tokenizer.
[TM][INFO] Barrier(1)
[TM][INFO] Barrier(1)
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
[TM][INFO] [LlamaCacheManager] max_entry_count = 4
[TM][INFO] [LlamaCacheManager] chunk_size = 1
[TM][INFO] [LlamaCacheManager][allocate]
[TM][INFO] [LlamaCacheManager][allocate] malloc 1
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/dist-packages/lmdeploy/turbomind/turbomind.py", line 198, in _create_model_instance
    model_inst = self.tm_model.model.create_model_instance(
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /opt/tritonserver/lmdeploy/src/turbomind/utils/allocator.h:231

server is gonna mount on: http://10.0.232.24:8018
Running on local URL:  http://10.0.232.24:8018

When I use TP=2, it seems internlm-chat-20b needs 51.2G GPU memory, even with max_batch_size=1 and instance_num=1 config. Is this normal for a 20B model cost 51.2G memory? image

Also, when I run benchmark/profile_throughput.py, even concurrency=8 num_prompt2=2000 tp=2 will cause OOM on A40 48G. When I decrease the concurrency=4 the memory utilization is shown following: image Comparing to internlm-chat-7b, the increase of memory usage is really sharp, is this because internlm-chat-20b have 16K context length? Are there any parameters to cut down the memory utilization, like limit the context length, since there isn't that long context case in the benchmark/profile_throughput.py?

G4mot commented 6 months ago

may this help fix OOM ? https://github.com/InternLM/lmdeploy/pull/1040

lvhan028 commented 6 months ago

pytorch engine已经重构了。请使用最新的版本。