InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.71k stars 431 forks source link

[Bug] Deployment of Llama3.1-70b getting struck #2724

Open pulkitmehtaworkmetacube opened 2 weeks ago

pulkitmehtaworkmetacube commented 2 weeks ago

Checklist

Describe the bug

We are trying to deploy llama3.1-70b on GCP with below specs . GPU - 2 x NVIDIA A100 80GB Machine Type - a2-ultragpu-2g (350GB Ram) SSD - 2TB

Command we tried for deployment : lmdeploy serve api_server meta-llama/Llama-3.1-70B-Instruct --tp 2 During deployment , we get struck at Fetching 42 files: 100%|████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 12190.21it/s] [WARNING] gemm_config.in is not found; using default GEMM algo [WARNING] gemm_config.in is not found; using default GEMM algo

It gets struck here for hours without any other error . We checked GPu , CPU usage as well .Please suggest

$ free -g total used free shared buff/cache available Mem: 334 0 200 0 133 330 Swap: 0 0 0

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-SXM4-80GB On | 00000000:00:05.0 Off | 0 | | N/A 35C P0 94W / 400W | 76649MiB / 81920MiB | 100% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB On | 00000000:00:06.0 Off | 0 | | N/A 34C P0 69W / 400W | 80733MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+

Reproduction

lmdeploy serve api_server meta-llama/Llama-3.1-70B-Instruct --tp 2

Environment

GPU - 2 x NVIDIA A100 80GB
Machine Type - a2-ultragpu-2g (350GB Ram)
SSD - 2TB

Error traceback

During deployment , we get struck at 
Fetching 42 files: 100%|████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 12190.21it/s]
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
zhyncs commented 2 weeks ago

use latest version

pulkitmehtaworkmetacube commented 2 weeks ago

@zhyncs Currently using LMDEPLOY_VERSION=0.6.2 Driver Version: 550.90.07 CUDA Version: 12.4

pulkitmehtaworkmetacube commented 2 weeks ago

We tried latest version .. for 1 TP , we are getting cuda out of memory error .. We observed that when we did 2 TP , memory from 2nd GPU was not getting used . Please suggest . nvidia-smi Thu Nov 7 11:12:24 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-SXM4-80GB On | 00000000:00:05.0 Off | 0 | | N/A 39C P0 98W / 400W | 76649MiB / 81920MiB | 100% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB On | 00000000:00:06.0 Off | 0 | | N/A 34C P0 72W / 400W | 80733MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3304 C /opt/conda/envs/lmdeploy/bin/python3.8 76638MiB | | 1 N/A N/A 3304 C /opt/conda/envs/lmdeploy/bin/python3.8 80722MiB |

lvhan028 commented 2 weeks ago

May upgrade to v6.2.0.post1. And append --log-level INFO when start the service. Let's check the log

jatin-wald commented 2 weeks ago

$ lmdeploy serve api_server meta-llama/Llama-3.1-70B-Instruct --tp 2 --dtype float16 --log-level INFO

`Fetching 42 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 4813.00it/s] 2024-11-07 12:04:20,426 - lmdeploy - INFO - async_engine.py:142 - input backend=turbomind, backend_config=TurbomindEngineConfig(dtype='float16', model_format=None, tp=2, session_len=None, max_batch_size=256, cache_max_entry_count=0.8, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1) 2024-11-07 12:04:20,426 - lmdeploy - INFO - async_engine.py:144 - input chat_template_config=None 2024-11-07 12:04:20,439 - lmdeploy - INFO - async_engine.py:154 - updated chat_template_onfig=ChatTemplateConfig(model_name='llama3_1', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None) 2024-11-07 12:04:20,439 - lmdeploy - INFO - turbomind.py:301 - model_source: hf_model 2024-11-07 12:04:21,556 - lmdeploy - INFO - turbomind.py:200 - turbomind model config:

{ "model_config": { "model_name": "", "chat_template": "", "model_arch": "LlamaForCausalLM", "head_num": 64, "kv_head_num": 8, "hidden_units": 8192, "vocab_size": 128256, "num_layer": 80, "inter_size": 28672, "norm_eps": 1e-05, "attn_bias": 0, "start_id": 128000, "end_id": 128009, "size_per_head": 128, "group_size": 128, "weight_type": "float16", "session_len": 131072, "tp": 2, "model_format": "hf", "expert_num": 0, "expert_inter_size": 0, "experts_per_token": 0 }, "attention_config": { "rotary_embedding": 128, "rope_theta": 500000.0, "max_position_embeddings": 131072, "original_max_position_embeddings": 8192, "rope_scaling_type": "llama3", "rope_scaling_factor": 8.0, "use_dynamic_ntk": 0, "low_freq_factor": 1.0, "high_freq_factor": 4.0, "use_logn_attn": 0, "cache_block_seq_len": 64 }, "lora_config": { "lora_policy": "", "lora_r": 0, "lora_scale": 0.0, "lora_max_wo_r": 0, "lora_rank_pattern": "", "lora_scale_pattern": "" }, "engine_config": { "dtype": "float16", "model_format": null, "tp": 2, "session_len": null, "max_batch_size": 256, "cache_max_entry_count": 0.8, "cache_chunk_size": -1, "cache_block_seq_len": 64, "enable_prefix_caching": false, "quant_policy": 0, "rope_scaling_factor": 0.0, "use_logn_attn": false, "download_dir": null, "revision": null, "max_prefill_token_num": 8192, "num_tokens_per_iter": 8192, "max_prefill_iters": 16 } } [TM][WARNING] [LlamaTritonModel] max_context_token_num is not set, default to 131072. [TM][INFO] Model: head_num: 64 kv_head_num: 8 size_per_head: 128 inter_size: 28672 num_layer: 80 vocab_size: 128256 attn_bias: 0 max_batch_size: 256 max_prefill_token_num: 8192 max_context_token_num: 131072 num_tokens_per_iter: 8192 max_prefill_iters: 16 session_len: 131072 cache_max_entry_count: 0.8 cache_block_seq_len: 64 cache_chunk_size: -1 enable_prefix_caching: 0 start_id: 128000 tensor_para_size: 2 pipeline_para_size: 1 enable_custom_all_reduce: 0 model_name: model_dir: quant_policy: 0 group_size: 128 expert_num: 0 expert_per_token: 0 moe_method: 1

[TM][INFO] TM_FUSE_SILU_ACT=1 2024-11-07 12:04:22,680 - lmdeploy - WARNING - turbomind.py:231 - get 965 model params [TM][INFO] [LlamaWeight::prepare] workspace size: 469762048

[TM][INFO] [LlamaWeight::prepare] workspace size: 469762048

[WARNING] gemm_config.in is not found; using default GEMM algo [WARNING] gemm_config.in is not found; using default GEMM algo [TM][INFO] [BlockManager] block_size = 10 MB [TM][INFO] [BlockManager] max_block_count = 534 [TM][INFO] [BlockManager] chunk_size = 534 [TM][INFO] [BlockManager] block_size = 10 MB [TM][INFO] [BlockManager] max_block_count = 534 [TM][INFO] [BlockManager] chunk_size = 534 [TM][WARNING] No enough blocks for session_len (131072), session_len truncated to 34176. [TM][INFO] LlamaBatch::Start() [TM][INFO] LlamaBatch::Start() [TM][INFO] [Gemm2] Tuning sequence: 8, 16, 32, 48, 64, 96, 128, 192, 256, 384, 512, 768, 1024, 1536, 2048, 3072, 4096, 6144, 8192 [TM][INFO] [Gemm2] 8 [TM][INFO] [Gemm2] 16 [TM][INFO] [Gemm2] 32 [TM][INFO] [Gemm2] 48 [TM][INFO] [Gemm2] 64 [TM][INFO] [Gemm2] 96 [TM][INFO] [Gemm2] 128 [TM][INFO] [Gemm2] 192 [TM][INFO] [Gemm2] 256 [TM][INFO] [Gemm2] 384 [TM][INFO] [Gemm2] 512 [TM][INFO] [Gemm2] 768 [TM][INFO] [Gemm2] 1024 [TM][INFO] [Gemm2] 1536 [TM][INFO] [Gemm2] 2048 [TM][INFO] [Gemm2] 3072 [TM][INFO] [InternalThreadEntry] stop requested. [TM][INFO] [InternalThreadEntry] stop requested. [TM][WARNING] pointermapping does not have information of ptr at 0x2d43d5f200.`

GPU USAGE

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     10138      C   /opt/conda/envs/lmdeploy/bin/python3.8      75230MiB |
|    1   N/A  N/A     10138      C   /opt/conda/envs/lmdeploy/bin/python3.8      80716MiB |
+-----------------------------------------------------------------------------------------+
Thu Nov  7 12:20:43 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:00:05.0 Off |                    0 |
| N/A   39C    P0             99W /  400W |   75241MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00000000:00:06.0 Off |                    0 |
| N/A   35C    P0             72W /  400W |   80727MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     10138      C   /opt/conda/envs/lmdeploy/bin/python3.8      75230MiB |
|    1   N/A  N/A     10138      C   /opt/conda/envs/lmdeploy/bin/python3.8      80716MiB |
+-----------------------------------------------------------------------------------------+
Thu Nov  7 12:20:49 2024
jatin-wald commented 1 week ago

Any luck anyone?

lzhangzz commented 1 week ago

So you get the log just by starting the server without sending any requests? This is more likely to be caused by the bug in v0.6.2 (instead of v0.6.2.post1)