Closed zxs789 closed 3 months ago
Does this happen under normal llama3 weights (from huggingface)?
Does this happen under normal llama3 weights (from huggingface)?
Thanks, I will try normal llama weights.
tensorrt_llm/parameter.py
def set_all_one_dummy(self):
self.value = np.ones(self._shape, trt_dtype_to_np(self._dtype))
tensorrt_llm/models/modeling_utils.py
for name, param in model.named_parameters():
param.set_all_one_dummy()
When I tried set all weights to one when build engine, get the same error of divide-by-zero
Confirm one more thing, does TensorRT-LLM support H20 device?
Confirm one more thing, does TensorRT-LLM support H20 device?
Thanks, we're trying to figure out why. Can you give us a specific command that went wrong? This includes converting and building commands.
I just tried the commands on H20, and it works well:
python examples/llama/convert_checkpoint.py --model_dir ./tmp/llama-v2-13b-hf/ --output_dir ./tmp/llama2_13b/ckpt
trtllm-build --checkpoint_dir ./tmp/llama2_13b/ckpt/ --gemm_plugin float16 --output_dir ./tmp/llama2_13b/engine
logs
[06/04/2024-09:36:55] [TRT] [I] Serialized 152374 bytes of compilation cache.
[06/04/2024-09:36:55] [TRT] [I] Serialized 8 timing cache entries
[06/04/2024-09:36:55] [TRT-LLM] [I] Timing cache serialized to model.cache
[06/04/2024-09:36:55] [TRT-LLM] [I] Serializing engine to ./tmp/llama2_13b/engine/rank0.engine...
[06/04/2024-09:38:05] [TRT-LLM] [I] Engine serialized. Total time: 00:01:10
[06/04/2024-09:38:06] [TRT-LLM] [I] Total time of building all engines: 00:04:13
I tried chatglm2_6b_tp1 and llama2_13b_tp2 on H20 GPU, and both encountered a divide-by-zero issue. I used random weights, and when I changed all the weights to 1, this problem still occurred. Here are the steps to reproduce it:
System Info Device:H20 Driver:535.161.07 cuda-toolkit:12.2.0
python env: nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 tensorrt 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.11.0.dev2024052800 torch 2.3.0
convert_config/chatglm2_6b/float16/1-gpu/config.json
{
"architecture": "ChatGLMForCausalLM",
"dtype": "float16",
"logits_dtype": "float32",
"num_hidden_layers": 28,
"num_attention_heads": 32,
"num_key_value_heads": 2,
"hidden_size": 4096,
"intermediate_size": 13696,
"norm_epsilon": 1e-05,
"vocab_size": 65024,
"position_embedding_type": "rope_gptj",
"max_position_embeddings": 32768,
"hidden_act": "swiglu",
"use_parallel_embedding": false,
"embedding_sharding_dim": 0,
"share_embedding_table": false,
"quantization": {
"quant_algo": null,
"kv_cache_quant_algo": null
},
"mapping": {
"world_size": 1,
"tp_size": 1,
"pp_size": 1
},
"chatglm_version": "chatglm2",
"add_bias_linear": false,
"add_qkv_bias": true,
"apply_query_key_layer_scaling": false,
"apply_residual_connection_post_layernorm": false,
"rmsnorm": true,
"rope_ratio": 1.0
}
build.sh
#!/bin/bash
model=chatglm2_6b
tp=1
dtype=fp16
model_config=./convert_config/$model/float16/${tp}-gpu/config.json
output_dir=./engines/$model/trt_engines/fp16/${tp}-gpu
max_batch_size=1
max_input_len=1024
max_output_len=200
trtllm-build \
--model_config $model_config \
--gpt_attention_plugin float16 \
--remove_input_padding enable \
--context_fmha enable \
--gemm_plugin float16 \
--max_batch_size $max_batch_size \
--output_dir $output_dir \
--context_fmha_fp32_acc enable \
--enable_xqa enable \
--max_input_len $max_input_len \
--max_output_len $max_output_len \
--multi_block_mode enable \
--paged_kv_cache disable \
--remove_input_padding enable \
--strongly_typed \
--use_custom_all_reduce enable \
--use_fused_mlp \
--workers $tp
python3 benchmarks/python/benchmark.py --model chatglm2_6b --mode plugin --batch_size 1 --input_output_len "1024,200" --warm_up 1 --num_runs 4 --engine_dir ./engines/chatglm2_6b/trt_engines/fp16/1-gpu/
diff --git a/tensorrt_llm/models/modeling_utils.py b/tensorrt_llm/models/modeling_utils.py
index 5ea7e04..d571434 100644
--- a/tensorrt_llm/models/modeling_utils.py
+++ b/tensorrt_llm/models/modeling_utils.py
@@ -1186,4 +1186,8 @@ def load_model(
from_pruned=is_checkpoint_pruned)
model.load(weights, from_pruned=is_checkpoint_pruned)
+ for name, param in model.named_parameters():
+ param.set_all_one_dummy()
+ print(name, param.shape, param.print_value())
+
return model
diff --git a/tensorrt_llm/parameter.py b/tensorrt_llm/parameter.py
index 42dc42b..2e92e63 100644
--- a/tensorrt_llm/parameter.py
+++ b/tensorrt_llm/parameter.py
@@ -140,6 +140,12 @@ class Parameter:
self.value = v
+ def print_value(self):
+ return self._value
+
+ def set_all_one_dummy(self):
+ self.value = np.ones(self._shape, trt_dtype_to_np(self._dtype))
+
def _get_weights(self) -> trt.Weights:
if isinstance(self._value, Tensor):
self._value.producer.__class__ = trt.IConstantLayer
Thanks again.
Confirm one more thing, does TensorRT-LLM support H20 device?
Thanks, we're trying to figure out why. Can you give us a specific command that went wrong? This includes converting and building commands.
I just tried the commands on H20, and it works well:
python examples/llama/convert_checkpoint.py --model_dir ./tmp/llama-v2-13b-hf/ --output_dir ./tmp/llama2_13b/ckpt trtllm-build --checkpoint_dir ./tmp/llama2_13b/ckpt/ --gemm_plugin float16 --output_dir ./tmp/llama2_13b/engine logs [06/04/2024-09:36:55] [TRT] [I] Serialized 152374 bytes of compilation cache. [06/04/2024-09:36:55] [TRT] [I] Serialized 8 timing cache entries [06/04/2024-09:36:55] [TRT-LLM] [I] Timing cache serialized to model.cache [06/04/2024-09:36:55] [TRT-LLM] [I] Serializing engine to ./tmp/llama2_13b/engine/rank0.engine... [06/04/2024-09:38:05] [TRT-LLM] [I] Engine serialized. Total time: 00:01:10 [06/04/2024-09:38:06] [TRT-LLM] [I] Total time of building all engines: 00:04:13
Yes,I can build engine success, but when i run benchmarks/python/benchmark.py encountered an error
Confirm one more thing, does TensorRT-LLM support H20 device?
Thanks, we're trying to figure out why. Can you give us a specific command that went wrong? This includes converting and building commands. I just tried the commands on H20, and it works well:
python examples/llama/convert_checkpoint.py --model_dir ./tmp/llama-v2-13b-hf/ --output_dir ./tmp/llama2_13b/ckpt trtllm-build --checkpoint_dir ./tmp/llama2_13b/ckpt/ --gemm_plugin float16 --output_dir ./tmp/llama2_13b/engine logs [06/04/2024-09:36:55] [TRT] [I] Serialized 152374 bytes of compilation cache. [06/04/2024-09:36:55] [TRT] [I] Serialized 8 timing cache entries [06/04/2024-09:36:55] [TRT-LLM] [I] Timing cache serialized to model.cache [06/04/2024-09:36:55] [TRT-LLM] [I] Serializing engine to ./tmp/llama2_13b/engine/rank0.engine... [06/04/2024-09:38:05] [TRT-LLM] [I] Engine serialized. Total time: 00:01:10 [06/04/2024-09:38:06] [TRT-LLM] [I] Total time of building all engines: 00:04:13
Yes,I can build engine success, but when i run benchmarks/python/benchmark.py encountered an error
Have you tried the weights/ckpt from HuggingFace? I remember that I ran the benchmark.py well with Chatglm3 yesterday.
Confirm one more thing, does TensorRT-LLM support H20 device?
Thanks, we're trying to figure out why. Can you give us a specific command that went wrong? This includes converting and building commands. I just tried the commands on H20, and it works well:
python examples/llama/convert_checkpoint.py --model_dir ./tmp/llama-v2-13b-hf/ --output_dir ./tmp/llama2_13b/ckpt trtllm-build --checkpoint_dir ./tmp/llama2_13b/ckpt/ --gemm_plugin float16 --output_dir ./tmp/llama2_13b/engine logs [06/04/2024-09:36:55] [TRT] [I] Serialized 152374 bytes of compilation cache. [06/04/2024-09:36:55] [TRT] [I] Serialized 8 timing cache entries [06/04/2024-09:36:55] [TRT-LLM] [I] Timing cache serialized to model.cache [06/04/2024-09:36:55] [TRT-LLM] [I] Serializing engine to ./tmp/llama2_13b/engine/rank0.engine... [06/04/2024-09:38:05] [TRT-LLM] [I] Engine serialized. Total time: 00:01:10 [06/04/2024-09:38:06] [TRT-LLM] [I] Total time of building all engines: 00:04:13
Yes,I can build engine success, but when i run benchmarks/python/benchmark.py encountered an error
Have you tried the weights/ckpt from HuggingFace? I remember that I ran the benchmark.py well with Chatglm3 yesterday.
Not yet, I am still downloading the real weights from hugging face, can you reproduce this error using random weights
Confirm one more thing, does TensorRT-LLM support H20 device?
Thanks, we're trying to figure out why. Can you give us a specific command that went wrong? This includes converting and building commands. I just tried the commands on H20, and it works well:
python examples/llama/convert_checkpoint.py --model_dir ./tmp/llama-v2-13b-hf/ --output_dir ./tmp/llama2_13b/ckpt trtllm-build --checkpoint_dir ./tmp/llama2_13b/ckpt/ --gemm_plugin float16 --output_dir ./tmp/llama2_13b/engine logs [06/04/2024-09:36:55] [TRT] [I] Serialized 152374 bytes of compilation cache. [06/04/2024-09:36:55] [TRT] [I] Serialized 8 timing cache entries [06/04/2024-09:36:55] [TRT-LLM] [I] Timing cache serialized to model.cache [06/04/2024-09:36:55] [TRT-LLM] [I] Serializing engine to ./tmp/llama2_13b/engine/rank0.engine... [06/04/2024-09:38:05] [TRT-LLM] [I] Engine serialized. Total time: 00:01:10 [06/04/2024-09:38:06] [TRT-LLM] [I] Total time of building all engines: 00:04:13
Yes,I can build engine success, but when i run benchmarks/python/benchmark.py encountered an error
Have you tried the weights/ckpt from HuggingFace? I remember that I ran the benchmark.py well with Chatglm3 yesterday.
Not yet, I am still downloading the real weights from hugging face, can you reproduce this error using random weights
No, I can't. My commands
python examples/llama/convert_checkpoint.py --model_dir ./tmp/llama-v2-13b-hf/ --output_dir ./tmp/llama2_13b
trtllm-build --model_config ./tmp/llama2_13b/config.json --gemm_plugin float16 --output_dir ./tmp/llama2_13b/engine2 --gpt_attention_plugin float16 --remove_input_padding enable --context_fmha enable --gemm_plugin float16 --max_batch_size 1 --context_fmha_fp32_acc enable --enable_xqa enable --max_input_len 1024 --max_output_len 200 --multi_block_mode enable --paged_kv_cache disable --remove_input_padding enable --use_custom_all_reduce enable --use_fused_mlp
python3 benchmarks/python/benchmark.py --model llama_13b --mode plugin --input_output_len "1024,200" --warm_up 1 --num_runs 4 --engine_dir ./tmp/llama2_13b/engine2/ --csv --batch_size 1
logs
Allocated 126.00 MiB for execution context memory.
model_name,world_size,num_heads,num_kv_heads,num_layers,hidden_size,vocab_size,precision,batch_size,gpu_weights_percent,input_length,output_length,gpu_peak_mem(gb),build_time(s),tokens_per_sec,percentile95(ms),percentile99(ms),latency(ms),compute_cap,quantization,generation_time(ms),total_generated_tokens,generation_tokens_per_second
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
[06/04/2024-10:05:30] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024060400
llama_13b,1,40,40,40,5120,32000,float16,1,1.0,1024,200,25.991,0,45.73,4403.05,4403.05,4373.095,sm90,QuantMode.0,4162.918,199.0,47.803
notice that my CUDA env is
NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4
trtllm 24.04 docker container
It is also recommended to update the TRT-LLM version to the latest (it will be updated to main branch today or tomorrow).
I tried chatglm2_6b_tp1 and llama2_13b_tp2 on H20 GPU, and both encountered a divide-by-zero issue. I used random weights, and when I changed all the weights to 1, this problem still occurred. Here are the steps to reproduce it:
System Info Device:H20 Driver:535.161.07 cuda-toolkit:12.2.0
python env: nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 tensorrt 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.11.0.dev2024052800 torch 2.3.0
1. Create the config file:
convert_config/chatglm2_6b/float16/1-gpu/config.json
{ "architecture": "ChatGLMForCausalLM", "dtype": "float16", "logits_dtype": "float32", "num_hidden_layers": 28, "num_attention_heads": 32, "num_key_value_heads": 2, "hidden_size": 4096, "intermediate_size": 13696, "norm_epsilon": 1e-05, "vocab_size": 65024, "position_embedding_type": "rope_gptj", "max_position_embeddings": 32768, "hidden_act": "swiglu", "use_parallel_embedding": false, "embedding_sharding_dim": 0, "share_embedding_table": false, "quantization": { "quant_algo": null, "kv_cache_quant_algo": null }, "mapping": { "world_size": 1, "tp_size": 1, "pp_size": 1 }, "chatglm_version": "chatglm2", "add_bias_linear": false, "add_qkv_bias": true, "apply_query_key_layer_scaling": false, "apply_residual_connection_post_layernorm": false, "rmsnorm": true, "rope_ratio": 1.0 }
2. build the engine with model_config
build.sh
#!/bin/bash model=chatglm2_6b tp=1 dtype=fp16 model_config=./convert_config/$model/float16/${tp}-gpu/config.json output_dir=./engines/$model/trt_engines/fp16/${tp}-gpu max_batch_size=1 max_input_len=1024 max_output_len=200 trtllm-build \ --model_config $model_config \ --gpt_attention_plugin float16 \ --remove_input_padding enable \ --context_fmha enable \ --gemm_plugin float16 \ --max_batch_size $max_batch_size \ --output_dir $output_dir \ --context_fmha_fp32_acc enable \ --enable_xqa enable \ --max_input_len $max_input_len \ --max_output_len $max_output_len \ --multi_block_mode enable \ --paged_kv_cache disable \ --remove_input_padding enable \ --strongly_typed \ --use_custom_all_reduce enable \ --use_fused_mlp \ --workers $tp
3. run benchmark
python3 benchmarks/python/benchmark.py --model chatglm2_6b --mode plugin --batch_size 1 --input_output_len "1024,200" --warm_up 1 --num_runs 4 --engine_dir ./engines/chatglm2_6b/trt_engines/fp16/1-gpu/
4. Below is the code where I changed all the weights to 1:
diff --git a/tensorrt_llm/models/modeling_utils.py b/tensorrt_llm/models/modeling_utils.py index 5ea7e04..d571434 100644 --- a/tensorrt_llm/models/modeling_utils.py +++ b/tensorrt_llm/models/modeling_utils.py @@ -1186,4 +1186,8 @@ def load_model( from_pruned=is_checkpoint_pruned) model.load(weights, from_pruned=is_checkpoint_pruned) + for name, param in model.named_parameters(): + param.set_all_one_dummy() + print(name, param.shape, param.print_value()) + return model diff --git a/tensorrt_llm/parameter.py b/tensorrt_llm/parameter.py index 42dc42b..2e92e63 100644 --- a/tensorrt_llm/parameter.py +++ b/tensorrt_llm/parameter.py @@ -140,6 +140,12 @@ class Parameter: self.value = v + def print_value(self): + return self._value + + def set_all_one_dummy(self): + self.value = np.ones(self._shape, trt_dtype_to_np(self._dtype)) + def _get_weights(self) -> trt.Weights: if isinstance(self._value, Tensor): self._value.producer.__class__ = trt.IConstantLayer
Thanks again.
this is my commends with random weights,can you run this commends on h20,if it worked,i can update my driver and cuda version,and the tensorrt-llm version
Hi, it also works well with the random weights + chatglm2_6b. It may be an issue of Drvier and CUDA versions.
logs
trtllm-build --model_config ./tmp/chatglm2_6b/ckpt/config.json --gemm_plugin float16 --output_dir ./tmp/chatglm2_6b/engine2 --gpt_attention_plugin float16 --remove_input_padding enable --context_fmha enable --gemm_plugin float16 --max_batch_size 1 --context_fmha_fp32_acc enable --enable_xqa enable --max_input_len 1024 --max_output_len 200 --multi_block_mode enable --paged_kv_cache disable --remove_input_padding enable --use_custom_all_reduce enable --use_fused_mlp
python3 benchmarks/python/benchmark.py --model chatglm2_6b --mode plugin --input_output_len "1024,200" --warm_up 1 --num_runs 4 --engine_dir ./tmp/chatglm2_6b/engine2/ --csv --batch_size 1
...
Allocated 117.50 MiB for execution context memory.
model_name,world_size,num_heads,num_kv_heads,num_layers,hidden_size,vocab_size,precision,batch_size,gpu_weights_percent,input_length,output_length,gpu_peak_mem(gb),build_time(s),tokens_per_sec,percentile95(ms),percentile99(ms),latency(ms),compute_cap,quantization,generation_time(ms),total_generated_tokens,generation_tokens_per_second
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
[06/04/2024-10:28:33] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024060400
chatglm2_6b,1,32,2,28,4096,65024,float16,1,1.0,1024,200,12.433,0,105.65,1904.281,1908.865,1893.025,sm90,QuantMode.0,1796.244,199.0,110.78
Thanks a lot,i can try update the driver and cuda version。
Hi, it also works well with the random weights + chatglm2_6b. It may be an issue of Drvier and CUDA versions.
logs
trtllm-build --model_config ./tmp/chatglm2_6b/ckpt/config.json --gemm_plugin float16 --output_dir ./tmp/chatglm2_6b/engine2 --gpt_attention_plugin float16 --remove_input_padding enable --context_fmha enable --gemm_plugin float16 --max_batch_size 1 --context_fmha_fp32_acc enable --enable_xqa enable --max_input_len 1024 --max_output_len 200 --multi_block_mode enable --paged_kv_cache disable --remove_input_padding enable --use_custom_all_reduce enable --use_fused_mlp python3 benchmarks/python/benchmark.py --model chatglm2_6b --mode plugin --input_output_len "1024,200" --warm_up 1 --num_runs 4 --engine_dir ./tmp/chatglm2_6b/engine2/ --csv --batch_size 1 ... Allocated 117.50 MiB for execution context memory. model_name,world_size,num_heads,num_kv_heads,num_layers,hidden_size,vocab_size,precision,batch_size,gpu_weights_percent,input_length,output_length,gpu_peak_mem(gb),build_time(s),tokens_per_sec,percentile95(ms),percentile99(ms),latency(ms),compute_cap,quantization,generation_time(ms),total_generated_tokens,generation_tokens_per_second /usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.) return _nested.nested_tensor( [06/04/2024-10:28:33] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024060400 chatglm2_6b,1,32,2,28,4096,65024,float16,1,1.0,1024,200,12.433,0,105.65,1904.281,1908.865,1893.025,sm90,QuantMode.0,1796.244,199.0,110.78
Can i ask your pip list,thanks a lot
Hi, it also works well with the random weights + chatglm2_6b. It may be an issue of Drvier and CUDA versions. logs
trtllm-build --model_config ./tmp/chatglm2_6b/ckpt/config.json --gemm_plugin float16 --output_dir ./tmp/chatglm2_6b/engine2 --gpt_attention_plugin float16 --remove_input_padding enable --context_fmha enable --gemm_plugin float16 --max_batch_size 1 --context_fmha_fp32_acc enable --enable_xqa enable --max_input_len 1024 --max_output_len 200 --multi_block_mode enable --paged_kv_cache disable --remove_input_padding enable --use_custom_all_reduce enable --use_fused_mlp python3 benchmarks/python/benchmark.py --model chatglm2_6b --mode plugin --input_output_len "1024,200" --warm_up 1 --num_runs 4 --engine_dir ./tmp/chatglm2_6b/engine2/ --csv --batch_size 1 ... Allocated 117.50 MiB for execution context memory. model_name,world_size,num_heads,num_kv_heads,num_layers,hidden_size,vocab_size,precision,batch_size,gpu_weights_percent,input_length,output_length,gpu_peak_mem(gb),build_time(s),tokens_per_sec,percentile95(ms),percentile99(ms),latency(ms),compute_cap,quantization,generation_time(ms),total_generated_tokens,generation_tokens_per_second /usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.) return _nested.nested_tensor( [06/04/2024-10:28:33] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024060400 chatglm2_6b,1,32,2,28,4096,65024,float16,1,1.0,1024,200,12.433,0,105.65,1904.281,1908.865,1893.025,sm90,QuantMode.0,1796.244,199.0,110.78
Can i ask your pip list,thanks a lot
I used the clean trtllm-24.04 container
pip list
Package Version
------------------------- --------------------------
absl-py 2.1.0
accelerate 0.30.1
aiohttp 3.9.3
aiosignal 1.3.1
annotated-types 0.6.0
apex 0.1
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
asttokens 2.4.1
astunparse 1.6.3
async-timeout 4.0.3
attrs 23.2.0
audioread 3.0.1
beautifulsoup4 4.12.3
bleach 6.1.0
blis 0.7.11
build 1.2.1
cachetools 5.3.3
catalogue 2.0.10
certifi 2024.2.2
cffi 1.16.0
charset-normalizer 3.3.2
click 8.1.7
cloudpathlib 0.16.0
cloudpickle 3.0.0
cmake 3.29.0.1
colored 2.2.4
coloredlogs 15.0.1
comm 0.2.2
confection 0.1.4
contourpy 1.2.1
cuda-python 12.4.0rc7+3.ge75c8a9.dirty
cudf 24.2.0
cudnn 1.1.2
cugraph 24.2.0
cugraph-dgl 24.2.0
cugraph-service-client 24.2.0
cugraph-service-server 24.2.0
cuml 24.2.0
cupy-cuda12x 13.0.0
cycler 0.12.1
cymem 2.0.8
Cython 3.0.10
dask 2024.1.1
dask-cuda 24.2.0
dask-cudf 24.2.0
datasets 2.19.2
debugpy 1.8.1
decorator 5.1.1
defusedxml 0.7.1
diffusers 0.28.0
dill 0.3.8
distributed 2024.1.1
dm-tree 0.1.8
einops 0.7.0
evaluate 0.4.2
exceptiongroup 1.2.0
execnet 2.0.2
executing 2.0.1
expecttest 0.1.3
fastjsonschema 2.19.1
fastrlock 0.8.2
filelock 3.13.3
flash-attn 2.4.2
fonttools 4.51.0
frozenlist 1.4.1
fsspec 2024.2.0
gast 0.5.4
google-auth 2.29.0
google-auth-oauthlib 0.4.6
graphsurgeon 0.4.6
grpcio 1.62.1
h5py 3.10.0
huggingface-hub 0.23.2
humanfriendly 10.0
hypothesis 5.35.1
idna 3.6
igraph 0.11.4
importlib_metadata 7.0.2
iniconfig 2.0.0
intel-openmp 2021.4.0
ipykernel 6.29.4
ipython 8.21.0
ipython-genutils 0.2.0
janus 1.0.0
jedi 0.19.1
Jinja2 3.1.3
joblib 1.3.2
json5 0.9.24
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
jupyter_client 8.6.1
jupyter_core 5.7.2
jupyter-tensorboard 0.2.0
jupyterlab 2.3.2
jupyterlab_pygments 0.3.0
jupyterlab-server 1.2.0
jupytext 1.16.1
kiwisolver 1.4.5
langcodes 3.3.0
lark 1.1.9
lazy_loader 0.4
librosa 0.10.1
lightning-thunder 0.1.0
lightning-utilities 0.11.2
llvmlite 0.42.0
locket 1.0.0
looseversion 1.3.0
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 2.1.5
matplotlib 3.8.4
matplotlib-inline 0.1.6
mdit-py-plugins 0.4.0
mdurl 0.1.2
mistune 3.0.2
mkl 2021.1.1
mkl-devel 2021.1.1
mkl-include 2021.1.1
mock 5.1.0
mpi4py 3.1.5
mpmath 1.3.0
msgpack 1.0.8
multidict 6.0.5
multiprocess 0.70.16
murmurhash 1.0.10
nbclient 0.10.0
nbconvert 7.16.3
nbformat 5.10.4
nest-asyncio 1.6.0
networkx 2.6.3
ninja 1.11.1.1
notebook 6.4.10
numba 0.59.0+1.g20ae2b56c
numpy 1.24.4
nvfuser 0.1.6a0+a684e2a
nvidia-dali-cuda120 1.36.0
nvidia-modelopt 0.11.2
nvidia-nvimgcodec-cu12 0.2.0.7
nvidia-pyindex 1.0.9
nvtx 0.2.5
oauthlib 3.2.2
onnx 1.16.0
opencv 4.7.0
opt-einsum 3.3.0
optimum 1.20.0
optree 0.11.0
packaging 23.2
pandas 1.5.3
pandocfilters 1.5.1
parso 0.8.4
partd 1.4.1
pexpect 4.9.0
pillow 10.2.0
pip 24.0
platformdirs 4.2.0
pluggy 1.4.0
ply 3.11
polygraphy 0.49.9
pooch 1.8.1
preshed 3.0.9
prettytable 3.10.0
prometheus_client 0.20.0
prompt-toolkit 3.0.43
protobuf 4.24.4
psutil 5.9.4
ptyprocess 0.7.0
PuLP 2.8.0
pure-eval 0.2.2
pyarrow 14.0.1
pyarrow-hotfix 0.6
pyasn1 0.6.0
pyasn1_modules 0.4.0
pybind11 2.12.0
pybind11_global 2.12.0
pycocotools 2.0+nv0.8.0
pycparser 2.22
pydantic 2.6.4
pydantic_core 2.16.3
Pygments 2.17.2
pylibcugraph 24.2.0
pylibcugraphops 24.2.0
pylibraft 24.2.0
pynvjitlink 0.1.13
pynvml 11.5.0
pyparsing 3.1.2
pyproject_hooks 1.1.0
pytest 8.1.1
pytest-flakefinder 1.1.0
pytest-rerunfailures 14.0
pytest-shard 0.1.2
pytest-xdist 3.5.0
python-dateutil 2.9.0.post0
python-hostlist 1.23.0
pytorch-quantization 2.1.2
pytorch-triton 3.0.0+a9bc1a364
pytz 2024.1
PyYAML 6.0.1
pyzmq 25.1.2
raft-dask 24.2.0
rapids-dask-dependency 24.2.0a0
referencing 0.34.0
regex 2023.12.25
requests 2.32.3
requests-oauthlib 2.0.0
rich 13.7.1
rmm 24.2.0
rpds-py 0.18.0
rsa 4.9
safetensors 0.4.3
scikit-learn 1.2.0
scipy 1.12.0
Send2Trash 1.8.2
sentencepiece 0.2.0
setuptools 68.2.2
six 1.16.0
smart-open 6.4.0
sortedcontainers 2.4.0
soundfile 0.12.1
soupsieve 2.5
soxr 0.3.7
spacy 3.7.4
spacy-legacy 3.0.12
spacy-loggers 1.0.5
sphinx_glpi_theme 0.6
srsly 2.4.8
stack-data 0.6.3
StrEnum 0.4.15
sympy 1.12
tabulate 0.9.0
tbb 2021.12.0
tblib 3.0.0
tensorboard 2.9.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorrt 10.0.1
tensorrt-llm 0.11.0.dev2024060400
terminado 0.18.1
texttable 1.7.0
thinc 8.2.3
threadpoolctl 3.3.0
thriftpy2 0.4.17
tinycss2 1.2.1
tokenizers 0.19.1
toml 0.10.2
tomli 2.0.1
toolz 0.12.1
torch 2.3.0a0+6ddf5cf85e.nv24.4
torch-tensorrt 2.3.0a0
torchdata 0.7.1a0
torchtext 0.17.0a0
torchvision 0.18.0a0
tornado 6.4
tqdm 4.66.2
traitlets 5.9.0
transformer-engine 1.5.0+6a9edc3
transformers 4.40.2
treelite 4.0.0
typer 0.9.4
types-dataclasses 0.6.6
typing_extensions 4.10.0
ucx-py 0.36.0
uff 0.6.9
urllib3 1.26.18
wasabi 1.1.2
wcwidth 0.2.13
weasel 0.3.4
webencodings 0.5.1
Werkzeug 3.0.2
wheel 0.43.0
xdoctest 1.0.2
xgboost 2.0.3
xxhash 3.4.1
yarl 1.9.4
zict 3.0.0
zipp 3.17.0
Hi, it also works well with the random weights + chatglm2_6b. It may be an issue of Drvier and CUDA versions. logs
trtllm-build --model_config ./tmp/chatglm2_6b/ckpt/config.json --gemm_plugin float16 --output_dir ./tmp/chatglm2_6b/engine2 --gpt_attention_plugin float16 --remove_input_padding enable --context_fmha enable --gemm_plugin float16 --max_batch_size 1 --context_fmha_fp32_acc enable --enable_xqa enable --max_input_len 1024 --max_output_len 200 --multi_block_mode enable --paged_kv_cache disable --remove_input_padding enable --use_custom_all_reduce enable --use_fused_mlp python3 benchmarks/python/benchmark.py --model chatglm2_6b --mode plugin --input_output_len "1024,200" --warm_up 1 --num_runs 4 --engine_dir ./tmp/chatglm2_6b/engine2/ --csv --batch_size 1 ... Allocated 117.50 MiB for execution context memory. model_name,world_size,num_heads,num_kv_heads,num_layers,hidden_size,vocab_size,precision,batch_size,gpu_weights_percent,input_length,output_length,gpu_peak_mem(gb),build_time(s),tokens_per_sec,percentile95(ms),percentile99(ms),latency(ms),compute_cap,quantization,generation_time(ms),total_generated_tokens,generation_tokens_per_second /usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.) return _nested.nested_tensor( [06/04/2024-10:28:33] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024060400 chatglm2_6b,1,32,2,28,4096,65024,float16,1,1.0,1024,200,12.433,0,105.65,1904.281,1908.865,1893.025,sm90,QuantMode.0,1796.244,199.0,110.78
Can i ask your pip list,thanks a lot
I used the clean trtllm-24.04 container
nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 do you mean this docker?
Hi, it also works well with the random weights + chatglm2_6b. It may be an issue of Drvier and CUDA versions. logs
trtllm-build --model_config ./tmp/chatglm2_6b/ckpt/config.json --gemm_plugin float16 --output_dir ./tmp/chatglm2_6b/engine2 --gpt_attention_plugin float16 --remove_input_padding enable --context_fmha enable --gemm_plugin float16 --max_batch_size 1 --context_fmha_fp32_acc enable --enable_xqa enable --max_input_len 1024 --max_output_len 200 --multi_block_mode enable --paged_kv_cache disable --remove_input_padding enable --use_custom_all_reduce enable --use_fused_mlp python3 benchmarks/python/benchmark.py --model chatglm2_6b --mode plugin --input_output_len "1024,200" --warm_up 1 --num_runs 4 --engine_dir ./tmp/chatglm2_6b/engine2/ --csv --batch_size 1 ... Allocated 117.50 MiB for execution context memory. model_name,world_size,num_heads,num_kv_heads,num_layers,hidden_size,vocab_size,precision,batch_size,gpu_weights_percent,input_length,output_length,gpu_peak_mem(gb),build_time(s),tokens_per_sec,percentile95(ms),percentile99(ms),latency(ms),compute_cap,quantization,generation_time(ms),total_generated_tokens,generation_tokens_per_second /usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.) return _nested.nested_tensor( [06/04/2024-10:28:33] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024060400 chatglm2_6b,1,32,2,28,4096,65024,float16,1,1.0,1024,200,12.433,0,105.65,1904.281,1908.865,1893.025,sm90,QuantMode.0,1796.244,199.0,110.78
Can i ask your pip list,thanks a lot
I used the clean trtllm-24.04 container
nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 do you mean this docker?
That should be OK. nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 is better.
OK,I can pull this docker and update my driver version, thanks again.
Hi, there
Can I ask how to get the 0604 version:
tensorrt-llm 0.11.0.dev2024060400
Thanks a lot!
BR
Hi, there Can I ask how to get the 0604 version:
tensorrt-llm 0.11.0.dev2024060400
Thanks a lot! BR
Hi, you can download this:https://pypi.nvidia.com/tensorrt-llm/tensorrt_llm-0.11.0.dev2024060400-cp310-cp310-linux_x86_64.whl
Thanks for your reply. Using tensorrt-llm 0.11.0.dev2024060400 with triton 2404 and 2405, have you met the tensorrt not found issue? I have installed tensorrt and it is in pip list:
Thanks for your reply. Using tensorrt-llm 0.11.0.dev2024060400 with triton 2404 and 2405, have you met the tensorrt not found issue? I have installed tensorrt and it is in pip list:
No,I didn't met this error, i install tensorrt with release TensorRT-10.0.1.6.Linux.x86_64-gnu.cuda-12.4.tar.gz
@hijkzzz It run successed on H20 with driver:550.54.15 + cuda:12.4.1 + torch:torch-2.4.0a0+07cecf4.nv24.5-cp310 + trtllm-0604
System Info
Device:H20 Driver:535.161.07 cuda-toolkit:12.2.0
python env: nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 tensorrt 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.11.0.dev2024052800
Who can help?
@ncomly-nvidia @kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
1.convert_config/llama_13b/float16/2-gpu/config.json
{ "architecture": "LlamaForCausalLM", "dtype": "float16", "logits_dtype": "float32", "vocab_size": 32000, "max_position_embeddings": 4096, "hidden_size": 5120, "num_hidden_layers": 40, "num_attention_heads": 40, "num_key_value_heads": 40, "head_size": 128, "hidden_act": "silu", "intermediate_size": 13824, "norm_epsilon": 1e-05, "position_embedding_type": "rope_gpt_neox", "use_parallel_embedding": false, "embedding_sharding_dim": 0, "share_embedding_table": false, "mapping": { "world_size": 2, "tp_size": 2, "pp_size": 1 }, "quantization": { "quant_algo": null, "kv_cache_quant_algo": null, "group_size": 128, "smoothquant_val": null, "has_zero_point": false, "pre_quant_scale": false, "exclude_modules": [ "lm_head" ] }, "kv_dtype": "float16", "rotary_scaling": null, "moe_normalization_mode": null, "rotary_base": 10000.0, "moe_num_experts": 0, "moe_top_k": 0, "moe_tp_mode": 2, "attn_bias": false, "disable_weight_only_quant_plugin": false, "mlp_bias": false }
Expected behavior
Expect to print performance data normally.
actual behavior
additional notes
The same command and script can run normally on the A100, but it will report an error of divide-by-zero on the H20. Is it because the NVIDIA CUDA version is too low?