Open pulkitmehtaworkmetacube opened 2 weeks ago
use latest version
@zhyncs Currently using LMDEPLOY_VERSION=0.6.2 Driver Version: 550.90.07 CUDA Version: 12.4
We tried latest version .. for 1 TP , we are getting cuda out of memory error .. We observed that when we did 2 TP , memory from 2nd GPU was not getting used . Please suggest . nvidia-smi Thu Nov 7 11:12:24 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-SXM4-80GB On | 00000000:00:05.0 Off | 0 | | N/A 39C P0 98W / 400W | 76649MiB / 81920MiB | 100% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB On | 00000000:00:06.0 Off | 0 | | N/A 34C P0 72W / 400W | 80733MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3304 C /opt/conda/envs/lmdeploy/bin/python3.8 76638MiB | | 1 N/A N/A 3304 C /opt/conda/envs/lmdeploy/bin/python3.8 80722MiB |
May upgrade to v6.2.0.post1.
And append --log-level INFO
when start the service. Let's check the log
$ lmdeploy serve api_server meta-llama/Llama-3.1-70B-Instruct --tp 2 --dtype float16 --log-level INFO
`Fetching 42 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 4813.00it/s] 2024-11-07 12:04:20,426 - lmdeploy - INFO - async_engine.py:142 - input backend=turbomind, backend_config=TurbomindEngineConfig(dtype='float16', model_format=None, tp=2, session_len=None, max_batch_size=256, cache_max_entry_count=0.8, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1) 2024-11-07 12:04:20,426 - lmdeploy - INFO - async_engine.py:144 - input chat_template_config=None 2024-11-07 12:04:20,439 - lmdeploy - INFO - async_engine.py:154 - updated chat_template_onfig=ChatTemplateConfig(model_name='llama3_1', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None) 2024-11-07 12:04:20,439 - lmdeploy - INFO - turbomind.py:301 - model_source: hf_model 2024-11-07 12:04:21,556 - lmdeploy - INFO - turbomind.py:200 - turbomind model config:
{
"model_config": {
"model_name": "",
"chat_template": "",
"model_arch": "LlamaForCausalLM",
"head_num": 64,
"kv_head_num": 8,
"hidden_units": 8192,
"vocab_size": 128256,
"num_layer": 80,
"inter_size": 28672,
"norm_eps": 1e-05,
"attn_bias": 0,
"start_id": 128000,
"end_id": 128009,
"size_per_head": 128,
"group_size": 128,
"weight_type": "float16",
"session_len": 131072,
"tp": 2,
"model_format": "hf",
"expert_num": 0,
"expert_inter_size": 0,
"experts_per_token": 0
},
"attention_config": {
"rotary_embedding": 128,
"rope_theta": 500000.0,
"max_position_embeddings": 131072,
"original_max_position_embeddings": 8192,
"rope_scaling_type": "llama3",
"rope_scaling_factor": 8.0,
"use_dynamic_ntk": 0,
"low_freq_factor": 1.0,
"high_freq_factor": 4.0,
"use_logn_attn": 0,
"cache_block_seq_len": 64
},
"lora_config": {
"lora_policy": "",
"lora_r": 0,
"lora_scale": 0.0,
"lora_max_wo_r": 0,
"lora_rank_pattern": "",
"lora_scale_pattern": ""
},
"engine_config": {
"dtype": "float16",
"model_format": null,
"tp": 2,
"session_len": null,
"max_batch_size": 256,
"cache_max_entry_count": 0.8,
"cache_chunk_size": -1,
"cache_block_seq_len": 64,
"enable_prefix_caching": false,
"quant_policy": 0,
"rope_scaling_factor": 0.0,
"use_logn_attn": false,
"download_dir": null,
"revision": null,
"max_prefill_token_num": 8192,
"num_tokens_per_iter": 8192,
"max_prefill_iters": 16
}
}
[TM][WARNING] [LlamaTritonModel] max_context_token_num
is not set, default to 131072.
[TM][INFO] Model:
head_num: 64
kv_head_num: 8
size_per_head: 128
inter_size: 28672
num_layer: 80
vocab_size: 128256
attn_bias: 0
max_batch_size: 256
max_prefill_token_num: 8192
max_context_token_num: 131072
num_tokens_per_iter: 8192
max_prefill_iters: 16
session_len: 131072
cache_max_entry_count: 0.8
cache_block_seq_len: 64
cache_chunk_size: -1
enable_prefix_caching: 0
start_id: 128000
tensor_para_size: 2
pipeline_para_size: 1
enable_custom_all_reduce: 0
model_name:
model_dir:
quant_policy: 0
group_size: 128
expert_num: 0
expert_per_token: 0
moe_method: 1
[TM][INFO] TM_FUSE_SILU_ACT=1
2024-11-07 12:04:22,680 - lmdeploy - WARNING - turbomind.py:231 - get 965 model params
[TM][INFO] [LlamaWeight
[TM][INFO] [LlamaWeight
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] [BlockManager] block_size = 10 MB
[TM][INFO] [BlockManager] max_block_count = 534
[TM][INFO] [BlockManager] chunk_size = 534
[TM][INFO] [BlockManager] block_size = 10 MB
[TM][INFO] [BlockManager] max_block_count = 534
[TM][INFO] [BlockManager] chunk_size = 534
[TM][WARNING] No enough blocks for session_len
(131072), session_len
truncated to 34176.
[TM][INFO] LlamaBatch
GPU USAGE
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 10138 C /opt/conda/envs/lmdeploy/bin/python3.8 75230MiB |
| 1 N/A N/A 10138 C /opt/conda/envs/lmdeploy/bin/python3.8 80716MiB |
+-----------------------------------------------------------------------------------------+
Thu Nov 7 12:20:43 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:00:05.0 Off | 0 |
| N/A 39C P0 99W / 400W | 75241MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:00:06.0 Off | 0 |
| N/A 35C P0 72W / 400W | 80727MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 10138 C /opt/conda/envs/lmdeploy/bin/python3.8 75230MiB |
| 1 N/A N/A 10138 C /opt/conda/envs/lmdeploy/bin/python3.8 80716MiB |
+-----------------------------------------------------------------------------------------+
Thu Nov 7 12:20:49 2024
Any luck anyone?
So you get the log just by starting the server without sending any requests? This is more likely to be caused by the bug in v0.6.2 (instead of v0.6.2.post1)
Checklist
Describe the bug
We are trying to deploy llama3.1-70b on GCP with below specs . GPU - 2 x NVIDIA A100 80GB Machine Type - a2-ultragpu-2g (350GB Ram) SSD - 2TB
Command we tried for deployment : lmdeploy serve api_server meta-llama/Llama-3.1-70B-Instruct --tp 2 During deployment , we get struck at Fetching 42 files: 100%|████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 12190.21it/s] [WARNING] gemm_config.in is not found; using default GEMM algo [WARNING] gemm_config.in is not found; using default GEMM algo
It gets struck here for hours without any other error . We checked GPu , CPU usage as well .Please suggest
$ free -g total used free shared buff/cache available Mem: 334 0 200 0 133 330 Swap: 0 0 0
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-SXM4-80GB On | 00000000:00:05.0 Off | 0 | | N/A 35C P0 94W / 400W | 76649MiB / 81920MiB | 100% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB On | 00000000:00:06.0 Off | 0 | | N/A 34C P0 69W / 400W | 80733MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+
Reproduction
lmdeploy serve api_server meta-llama/Llama-3.1-70B-Instruct --tp 2
Environment
Error traceback