[Bug] Deployment of Llama3.1-70b getting struck

pulkitmehtaworkmetacube commented 2 weeks ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

We are trying to deploy llama3.1-70b on GCP with below specs . GPU - 2 x NVIDIA A100 80GB Machine Type - a2-ultragpu-2g (350GB Ram) SSD - 2TB

Command we tried for deployment : lmdeploy serve api_server meta-llama/Llama-3.1-70B-Instruct --tp 2 During deployment , we get struck at Fetching 42 files: 100%|████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 12190.21it/s] [WARNING] gemm_config.in is not found; using default GEMM algo [WARNING] gemm_config.in is not found; using default GEMM algo

It gets struck here for hours without any other error . We checked GPu , CPU usage as well .Please suggest

$ free -g total used free shared buff/cache available Mem: 334 0 200 0 133 330 Swap: 0 0 0

Reproduction

lmdeploy serve api_server meta-llama/Llama-3.1-70B-Instruct --tp 2

Environment

GPU - 2 x NVIDIA A100 80GB
Machine Type - a2-ultragpu-2g (350GB Ram)
SSD - 2TB

Error traceback

During deployment , we get struck at 
Fetching 42 files: 100%|████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 12190.21it/s]
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo

zhyncs commented 2 weeks ago

use latest version

pulkitmehtaworkmetacube commented 2 weeks ago

@zhyncs Currently using LMDEPLOY_VERSION=0.6.2 Driver Version: 550.90.07 CUDA Version: 12.4

pulkitmehtaworkmetacube commented 2 weeks ago

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3304 C /opt/conda/envs/lmdeploy/bin/python3.8 76638MiB | | 1 N/A N/A 3304 C /opt/conda/envs/lmdeploy/bin/python3.8 80722MiB |

lvhan028 commented 2 weeks ago

May upgrade to v6.2.0.post1. And append --log-level INFO when start the service. Let's check the log

jatin-wald commented 2 weeks ago

$ lmdeploy serve api_server meta-llama/Llama-3.1-70B-Instruct --tp 2 --dtype float16 --log-level INFO

`Fetching 42 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 4813.00it/s] 2024-11-07 12:04:20,426 - lmdeploy - INFO - async_engine.py:142 - input backend=turbomind, backend_config=TurbomindEngineConfig(dtype='float16', model_format=None, tp=2, session_len=None, max_batch_size=256, cache_max_entry_count=0.8, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1) 2024-11-07 12:04:20,426 - lmdeploy - INFO - async_engine.py:144 - input chat_template_config=None 2024-11-07 12:04:20,439 - lmdeploy - INFO - async_engine.py:154 - updated chat_template_onfig=ChatTemplateConfig(model_name='llama3_1', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None) 2024-11-07 12:04:20,439 - lmdeploy - INFO - turbomind.py:301 - model_source: hf_model 2024-11-07 12:04:21,556 - lmdeploy - INFO - turbomind.py:200 - turbomind model config:

{ "model_config": { "model_name": "", "chat_template": "", "model_arch": "LlamaForCausalLM", "head_num": 64, "kv_head_num": 8, "hidden_units": 8192, "vocab_size": 128256, "num_layer": 80, "inter_size": 28672, "norm_eps": 1e-05, "attn_bias": 0, "start_id": 128000, "end_id": 128009, "size_per_head": 128, "group_size": 128, "weight_type": "float16", "session_len": 131072, "tp": 2, "model_format": "hf", "expert_num": 0, "expert_inter_size": 0, "experts_per_token": 0 }, "attention_config": { "rotary_embedding": 128, "rope_theta": 500000.0, "max_position_embeddings": 131072, "original_max_position_embeddings": 8192, "rope_scaling_type": "llama3", "rope_scaling_factor": 8.0, "use_dynamic_ntk": 0, "low_freq_factor": 1.0, "high_freq_factor": 4.0, "use_logn_attn": 0, "cache_block_seq_len": 64 }, "lora_config": { "lora_policy": "", "lora_r": 0, "lora_scale": 0.0, "lora_max_wo_r": 0, "lora_rank_pattern": "", "lora_scale_pattern": "" }, "engine_config": { "dtype": "float16", "model_format": null, "tp": 2, "session_len": null, "max_batch_size": 256, "cache_max_entry_count": 0.8, "cache_chunk_size": -1, "cache_block_seq_len": 64, "enable_prefix_caching": false, "quant_policy": 0, "rope_scaling_factor": 0.0, "use_logn_attn": false, "download_dir": null, "revision": null, "max_prefill_token_num": 8192, "num_tokens_per_iter": 8192, "max_prefill_iters": 16 } } [TM][WARNING] [LlamaTritonModel] max_context_token_num is not set, default to 131072. [TM][INFO] Model: head_num: 64 kv_head_num: 8 size_per_head: 128 inter_size: 28672 num_layer: 80 vocab_size: 128256 attn_bias: 0 max_batch_size: 256 max_prefill_token_num: 8192 max_context_token_num: 131072 num_tokens_per_iter: 8192 max_prefill_iters: 16 session_len: 131072 cache_max_entry_count: 0.8 cache_block_seq_len: 64 cache_chunk_size: -1 enable_prefix_caching: 0 start_id: 128000 tensor_para_size: 2 pipeline_para_size: 1 enable_custom_all_reduce: 0 model_name: model_dir: quant_policy: 0 group_size: 128 expert_num: 0 expert_per_token: 0 moe_method: 1

[TM][INFO] TM_FUSE_SILU_ACT=1 2024-11-07 12:04:22,680 - lmdeploy - WARNING - turbomind.py:231 - get 965 model params [TM][INFO] [LlamaWeight::prepare] workspace size: 469762048

[TM][INFO] [LlamaWeight::prepare] workspace size: 469762048

[WARNING] gemm_config.in is not found; using default GEMM algo [WARNING] gemm_config.in is not found; using default GEMM algo [TM][INFO] [BlockManager] block_size = 10 MB [TM][INFO] [BlockManager] max_block_count = 534 [TM][INFO] [BlockManager] chunk_size = 534 [TM][INFO] [BlockManager] block_size = 10 MB [TM][INFO] [BlockManager] max_block_count = 534 [TM][INFO] [BlockManager] chunk_size = 534 [TM][WARNING] No enough blocks for session_len (131072), session_len truncated to 34176. [TM][INFO] LlamaBatch::Start() [TM][INFO] LlamaBatch::Start() [TM][INFO] [Gemm2] Tuning sequence: 8, 16, 32, 48, 64, 96, 128, 192, 256, 384, 512, 768, 1024, 1536, 2048, 3072, 4096, 6144, 8192 [TM][INFO] [Gemm2] 8 [TM][INFO] [Gemm2] 16 [TM][INFO] [Gemm2] 32 [TM][INFO] [Gemm2] 48 [TM][INFO] [Gemm2] 64 [TM][INFO] [Gemm2] 96 [TM][INFO] [Gemm2] 128 [TM][INFO] [Gemm2] 192 [TM][INFO] [Gemm2] 256 [TM][INFO] [Gemm2] 384 [TM][INFO] [Gemm2] 512 [TM][INFO] [Gemm2] 768 [TM][INFO] [Gemm2] 1024 [TM][INFO] [Gemm2] 1536 [TM][INFO] [Gemm2] 2048 [TM][INFO] [Gemm2] 3072 [TM][INFO] [InternalThreadEntry] stop requested. [TM][INFO] [InternalThreadEntry] stop requested. [TM][WARNING] pointermapping does not have information of ptr at 0x2d43d5f200.`

GPU USAGE

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     10138      C   /opt/conda/envs/lmdeploy/bin/python3.8      75230MiB |
|    1   N/A  N/A     10138      C   /opt/conda/envs/lmdeploy/bin/python3.8      80716MiB |
+-----------------------------------------------------------------------------------------+
Thu Nov  7 12:20:43 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:00:05.0 Off |                    0 |
| N/A   39C    P0             99W /  400W |   75241MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00000000:00:06.0 Off |                    0 |
| N/A   35C    P0             72W /  400W |   80727MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     10138      C   /opt/conda/envs/lmdeploy/bin/python3.8      75230MiB |
|    1   N/A  N/A     10138      C   /opt/conda/envs/lmdeploy/bin/python3.8      80716MiB |
+-----------------------------------------------------------------------------------------+
Thu Nov  7 12:20:49 2024

jatin-wald commented 1 week ago

Any luck anyone?

lzhangzz commented 1 week ago

So you get the log just by starting the server without sending any requests? This is more likely to be caused by the bug in v0.6.2 (instead of v0.6.2.post1)

InternLM / lmdeploy