NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.23k stars 912 forks source link

H20 Using random weights to infer llama2-13B results in a divide-by-zero error. #1717

Closed zxs789 closed 3 months ago

zxs789 commented 3 months ago

System Info

Device:H20 Driver:535.161.07 cuda-toolkit:12.2.0

python env: nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 tensorrt 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.11.0.dev2024052800

Who can help?

@ncomly-nvidia @kaiyux

Information

Tasks

Reproduction

1.convert_config/llama_13b/float16/2-gpu/config.json { "architecture": "LlamaForCausalLM", "dtype": "float16", "logits_dtype": "float32", "vocab_size": 32000, "max_position_embeddings": 4096, "hidden_size": 5120, "num_hidden_layers": 40, "num_attention_heads": 40, "num_key_value_heads": 40, "head_size": 128, "hidden_act": "silu", "intermediate_size": 13824, "norm_epsilon": 1e-05, "position_embedding_type": "rope_gpt_neox", "use_parallel_embedding": false, "embedding_sharding_dim": 0, "share_embedding_table": false, "mapping": { "world_size": 2, "tp_size": 2, "pp_size": 1 }, "quantization": { "quant_algo": null, "kv_cache_quant_algo": null, "group_size": 128, "smoothquant_val": null, "has_zero_point": false, "pre_quant_scale": false, "exclude_modules": [ "lm_head" ] }, "kv_dtype": "float16", "rotary_scaling": null, "moe_normalization_mode": null, "rotary_base": 10000.0, "moe_num_experts": 0, "moe_top_k": 0, "moe_tp_mode": 2, "attn_bias": false, "disable_weight_only_quant_plugin": false, "mlp_bias": false }

  1. run_build_llama13b.sh
#!/bin/bash

### 1. generate config json file

model=llama_13b
tp=2
dtype=fp16

model_config=./convert_config/$model/float16/${tp}-gpu/config.json
output_dir=./engines/$model/trt_engines/fp16/${tp}-gpu

### 2. generate engine with batches and input lens
max_output_len=200
declare -a input_lengths=(1024)

for ((i=0; i<${#input_lengths[@]}; i++)); do
  max_input_len=${input_lengths[$i]}

  case $max_input_len in
    1024) test_batch_sizes=(1 2 4 8 16 32 64 128 256) ;;
    *) echo "Invalid input length"; exit 1 ;;
  esac

  for b in "${test_batch_sizes[@]}"; do
    max_batch_size=$b
    echo "Running trtllm-build with max_input_len=$max_input_len, max_output_len=$max_output_len, max_batch_size=$max_batch_size"

    trtllm-build \
      --model_config $model_config \
      --gemm_plugin auto \
      --output_dir $output_dir

    if [ $? -ne 0 ]; then
      echo "trtllm-build failed for max_input_len=$max_input_len, max_batch_size=$max_batch_size"
    fi

    ### 3. begin test cases
    in=$max_input_len
    out=200

    work_dir=`pwd`
    engine_dir=./engines/$model/trt_engines/fp16/${tp}-gpu/

    echo "Running gptSessionBenchmark with input_len=$in, output_len=$out, batch_size=$b"
    mpirun --allow-run-as-root -n ${tp} python3 benchmarks/python/benchmark.py \
                                           --model ${model} \
                                           --mode plugin \
                                           --batch_size  "${b}" \
                                           --input_output_len "${in},${out}" \
                                           --warm_up 1 \
                                           --num_runs 4 \
                                           --engine_dir $engine_dir \
                                           --csv

    if [ $? -ne 0 ]; then
      echo "gptSessionBenchmark failed for input_len=$in, output_len=$out, batch_size=$b"
    fi
    ### 4. delete engine dir
    rm -rf $engine_dir
  done  
done
echo "All runs completed successfully."

Expected behavior

Expect to print performance data normally.

actual behavior

[96/03/2024-16:26:43] [TRT-LLM] [I] Engine serialized. Total time: 00:00:05
[06/03/2024-16:26:43] [TRT-LLM] [I] Total time of building all engines: 00:01:18
Running gptSessionBenchmark with input_len=1024,.output_len=200, batch_size=1
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
Allocated 117.50 MiB for execution context memory:
model_name,world_size,num_heads,num_kv_heads,num_layers,hidden_size,vocab_size,precision,batch_size,gpu_weights_percent,input_length,output_length,gpu_peak_mem(gb),build_time(s),tokens_per_sec,percentile95(ms),percentile99(ms),latency(ms),compute_cap,quantization,generation_time(ms),total_generated_tokens,generation_tokens_per_second
/root/anaconda3/envs/trt_llm/lib/python3.1o/site-packages/torch/nested/__init__py:166: UserWarning: The PyTorch API of nested tensors is
 in prototype stage and will change in the near future. (Triggered internally at . ./aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
[dlp2-29-140:159817] *** Process received signal ***
[dlp2-29-140:159817] Signal: Floating point exception (8)
[dlp2-29-140:159817] Signal code: Integer divide-by-zero (1)
[dlp2-29-140:159817] Failing at address: 0x2b228fc4ec59
[dlp2-29-140:159817] [ 0] /lib64/libpthread.s0.0(+0xf630)[0x2b220e562630]
[dlp2-29-140:159817] [ 1] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0xa0bc59)[0x2b228fc4ec59]
[dlp2-29-140:159817]_[_21 /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x814383)[0x2b228fa57383]
[dlp2-29-140:159817]_[_3] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x6ace72][0x2b228f8efe72]
[dlp2-29-140:159817] .[.4] /root/anaconda3/envs/trt_llm//lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x7aa087)[0x2b228f9ed087]
[dlp2-29-140:159817] [.5] /root/anaconda3/envs/trt_llm//lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x7ab055][0x2b228f9ee055]
[dlp2-29-140:159817] [ 6] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x7ab774)[0x2b228f9ee774]
[dlp2-29-140:159817]. [_7]./root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(c
ublasLtMatmul+0x1525)[0x2b228f9f2375]
[dlp2-29-140:159817] [ 8] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_zN12tensorrt_ll
m6common15CublasMMWrapper4GemmE17cublas0peration_tS2_iiiPKviS4_iPviffRK20cublasLtMatmuLAlgo_tbb+0xfd)[0x2b23b55e7a8d]
[dlp2-29-140:159817] [ 9] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_zN12tensorrt_ll
m6common15CublasMMWrapper4GemmE17cublas0peration_tS2_iiiPKviS4_iPviRKSt8optionalI31cublasLtMatmulHeuristicResult_tE+0x60)[0x2b23b55e7f70]

additional notes

The same command and script can run normally on the A100, but it will report an error of divide-by-zero on the H20. Is it because the NVIDIA CUDA version is too low?

hijkzzz commented 3 months ago

Does this happen under normal llama3 weights (from huggingface)?

zxs789 commented 3 months ago

Does this happen under normal llama3 weights (from huggingface)?

Thanks, I will try normal llama weights.

zxs789 commented 3 months ago

tensorrt_llm/parameter.py

def set_all_one_dummy(self):
        self.value = np.ones(self._shape, trt_dtype_to_np(self._dtype))

tensorrt_llm/models/modeling_utils.py

for name, param in model.named_parameters():
        param.set_all_one_dummy()

When I tried set all weights to one when build engine, get the same error of divide-by-zero

zxs789 commented 3 months ago

Confirm one more thing, does TensorRT-LLM support H20 device?

hijkzzz commented 3 months ago

Confirm one more thing, does TensorRT-LLM support H20 device?

Thanks, we're trying to figure out why. Can you give us a specific command that went wrong? This includes converting and building commands.

I just tried the commands on H20, and it works well:

python examples/llama/convert_checkpoint.py --model_dir ./tmp/llama-v2-13b-hf/ --output_dir ./tmp/llama2_13b/ckpt
trtllm-build --checkpoint_dir ./tmp/llama2_13b/ckpt/ --gemm_plugin float16 --output_dir ./tmp/llama2_13b/engine

logs
[06/04/2024-09:36:55] [TRT] [I] Serialized 152374 bytes of compilation cache.
[06/04/2024-09:36:55] [TRT] [I] Serialized 8 timing cache entries
[06/04/2024-09:36:55] [TRT-LLM] [I] Timing cache serialized to model.cache
[06/04/2024-09:36:55] [TRT-LLM] [I] Serializing engine to ./tmp/llama2_13b/engine/rank0.engine...
[06/04/2024-09:38:05] [TRT-LLM] [I] Engine serialized. Total time: 00:01:10
[06/04/2024-09:38:06] [TRT-LLM] [I] Total time of building all engines: 00:04:13
zxs789 commented 3 months ago

I tried chatglm2_6b_tp1 and llama2_13b_tp2 on H20 GPU, and both encountered a divide-by-zero issue. I used random weights, and when I changed all the weights to 1, this problem still occurred. Here are the steps to reproduce it:

System Info Device:H20 Driver:535.161.07 cuda-toolkit:12.2.0

python env: nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 tensorrt 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.11.0.dev2024052800 torch 2.3.0

1. Create the config file:

convert_config/chatglm2_6b/float16/1-gpu/config.json
{
    "architecture": "ChatGLMForCausalLM",
    "dtype": "float16",
    "logits_dtype": "float32",
    "num_hidden_layers": 28,
    "num_attention_heads": 32,
    "num_key_value_heads": 2,
    "hidden_size": 4096,
    "intermediate_size": 13696,
    "norm_epsilon": 1e-05,
    "vocab_size": 65024,
    "position_embedding_type": "rope_gptj",
    "max_position_embeddings": 32768,
    "hidden_act": "swiglu",
    "use_parallel_embedding": false,
    "embedding_sharding_dim": 0,
    "share_embedding_table": false,
    "quantization": {
        "quant_algo": null,
        "kv_cache_quant_algo": null
    },
    "mapping": {
        "world_size": 1,
        "tp_size": 1,
        "pp_size": 1
    },
    "chatglm_version": "chatglm2",
    "add_bias_linear": false,
    "add_qkv_bias": true,
    "apply_query_key_layer_scaling": false,
    "apply_residual_connection_post_layernorm": false,
    "rmsnorm": true,
    "rope_ratio": 1.0
}

2. build the engine with model_config

build.sh

#!/bin/bash
model=chatglm2_6b
tp=1
dtype=fp16

model_config=./convert_config/$model/float16/${tp}-gpu/config.json
output_dir=./engines/$model/trt_engines/fp16/${tp}-gpu
max_batch_size=1
max_input_len=1024
max_output_len=200

trtllm-build \
      --model_config $model_config \
      --gpt_attention_plugin float16 \
      --remove_input_padding enable \
      --context_fmha enable \
      --gemm_plugin float16 \
      --max_batch_size $max_batch_size \
      --output_dir $output_dir \
      --context_fmha_fp32_acc enable \
      --enable_xqa enable \
      --max_input_len $max_input_len \
      --max_output_len $max_output_len \
      --multi_block_mode enable \
      --paged_kv_cache disable \
      --remove_input_padding enable \
      --strongly_typed \
      --use_custom_all_reduce enable \
      --use_fused_mlp \
      --workers $tp

3. run benchmark

python3 benchmarks/python/benchmark.py --model chatglm2_6b --mode plugin --batch_size 1 --input_output_len "1024,200" --warm_up 1 --num_runs 4 --engine_dir ./engines/chatglm2_6b/trt_engines/fp16/1-gpu/

4. Below is the code where I changed all the weights to 1:

diff --git a/tensorrt_llm/models/modeling_utils.py b/tensorrt_llm/models/modeling_utils.py
index 5ea7e04..d571434 100644
--- a/tensorrt_llm/models/modeling_utils.py
+++ b/tensorrt_llm/models/modeling_utils.py
@@ -1186,4 +1186,8 @@ def load_model(
                            from_pruned=is_checkpoint_pruned)
         model.load(weights, from_pruned=is_checkpoint_pruned)

+    for name, param in model.named_parameters():
+        param.set_all_one_dummy()
+        print(name, param.shape, param.print_value())
+
     return model
diff --git a/tensorrt_llm/parameter.py b/tensorrt_llm/parameter.py
index 42dc42b..2e92e63 100644
--- a/tensorrt_llm/parameter.py
+++ b/tensorrt_llm/parameter.py
@@ -140,6 +140,12 @@ class Parameter:

         self.value = v

+    def print_value(self):
+        return self._value
+
+    def set_all_one_dummy(self):
+        self.value = np.ones(self._shape, trt_dtype_to_np(self._dtype))
+
     def _get_weights(self) -> trt.Weights:
         if isinstance(self._value, Tensor):
             self._value.producer.__class__ = trt.IConstantLayer

Thanks again.

zxs789 commented 3 months ago

Confirm one more thing, does TensorRT-LLM support H20 device?

Thanks, we're trying to figure out why. Can you give us a specific command that went wrong? This includes converting and building commands.

I just tried the commands on H20, and it works well:

python examples/llama/convert_checkpoint.py --model_dir ./tmp/llama-v2-13b-hf/ --output_dir ./tmp/llama2_13b/ckpt
trtllm-build --checkpoint_dir ./tmp/llama2_13b/ckpt/ --gemm_plugin float16 --output_dir ./tmp/llama2_13b/engine

logs
[06/04/2024-09:36:55] [TRT] [I] Serialized 152374 bytes of compilation cache.
[06/04/2024-09:36:55] [TRT] [I] Serialized 8 timing cache entries
[06/04/2024-09:36:55] [TRT-LLM] [I] Timing cache serialized to model.cache
[06/04/2024-09:36:55] [TRT-LLM] [I] Serializing engine to ./tmp/llama2_13b/engine/rank0.engine...
[06/04/2024-09:38:05] [TRT-LLM] [I] Engine serialized. Total time: 00:01:10
[06/04/2024-09:38:06] [TRT-LLM] [I] Total time of building all engines: 00:04:13

Yes,I can build engine success, but when i run benchmarks/python/benchmark.py encountered an error

hijkzzz commented 3 months ago

Confirm one more thing, does TensorRT-LLM support H20 device?

Thanks, we're trying to figure out why. Can you give us a specific command that went wrong? This includes converting and building commands. I just tried the commands on H20, and it works well:

python examples/llama/convert_checkpoint.py --model_dir ./tmp/llama-v2-13b-hf/ --output_dir ./tmp/llama2_13b/ckpt
trtllm-build --checkpoint_dir ./tmp/llama2_13b/ckpt/ --gemm_plugin float16 --output_dir ./tmp/llama2_13b/engine

logs
[06/04/2024-09:36:55] [TRT] [I] Serialized 152374 bytes of compilation cache.
[06/04/2024-09:36:55] [TRT] [I] Serialized 8 timing cache entries
[06/04/2024-09:36:55] [TRT-LLM] [I] Timing cache serialized to model.cache
[06/04/2024-09:36:55] [TRT-LLM] [I] Serializing engine to ./tmp/llama2_13b/engine/rank0.engine...
[06/04/2024-09:38:05] [TRT-LLM] [I] Engine serialized. Total time: 00:01:10
[06/04/2024-09:38:06] [TRT-LLM] [I] Total time of building all engines: 00:04:13

Yes,I can build engine success, but when i run benchmarks/python/benchmark.py encountered an error

Have you tried the weights/ckpt from HuggingFace? I remember that I ran the benchmark.py well with Chatglm3 yesterday.

zxs789 commented 3 months ago

Confirm one more thing, does TensorRT-LLM support H20 device?

Thanks, we're trying to figure out why. Can you give us a specific command that went wrong? This includes converting and building commands. I just tried the commands on H20, and it works well:

python examples/llama/convert_checkpoint.py --model_dir ./tmp/llama-v2-13b-hf/ --output_dir ./tmp/llama2_13b/ckpt
trtllm-build --checkpoint_dir ./tmp/llama2_13b/ckpt/ --gemm_plugin float16 --output_dir ./tmp/llama2_13b/engine

logs
[06/04/2024-09:36:55] [TRT] [I] Serialized 152374 bytes of compilation cache.
[06/04/2024-09:36:55] [TRT] [I] Serialized 8 timing cache entries
[06/04/2024-09:36:55] [TRT-LLM] [I] Timing cache serialized to model.cache
[06/04/2024-09:36:55] [TRT-LLM] [I] Serializing engine to ./tmp/llama2_13b/engine/rank0.engine...
[06/04/2024-09:38:05] [TRT-LLM] [I] Engine serialized. Total time: 00:01:10
[06/04/2024-09:38:06] [TRT-LLM] [I] Total time of building all engines: 00:04:13

Yes,I can build engine success, but when i run benchmarks/python/benchmark.py encountered an error

Have you tried the weights/ckpt from HuggingFace? I remember that I ran the benchmark.py well with Chatglm3 yesterday.

Not yet, I am still downloading the real weights from hugging face, can you reproduce this error using random weights

hijkzzz commented 3 months ago

Confirm one more thing, does TensorRT-LLM support H20 device?

Thanks, we're trying to figure out why. Can you give us a specific command that went wrong? This includes converting and building commands. I just tried the commands on H20, and it works well:

python examples/llama/convert_checkpoint.py --model_dir ./tmp/llama-v2-13b-hf/ --output_dir ./tmp/llama2_13b/ckpt
trtllm-build --checkpoint_dir ./tmp/llama2_13b/ckpt/ --gemm_plugin float16 --output_dir ./tmp/llama2_13b/engine

logs
[06/04/2024-09:36:55] [TRT] [I] Serialized 152374 bytes of compilation cache.
[06/04/2024-09:36:55] [TRT] [I] Serialized 8 timing cache entries
[06/04/2024-09:36:55] [TRT-LLM] [I] Timing cache serialized to model.cache
[06/04/2024-09:36:55] [TRT-LLM] [I] Serializing engine to ./tmp/llama2_13b/engine/rank0.engine...
[06/04/2024-09:38:05] [TRT-LLM] [I] Engine serialized. Total time: 00:01:10
[06/04/2024-09:38:06] [TRT-LLM] [I] Total time of building all engines: 00:04:13

Yes,I can build engine success, but when i run benchmarks/python/benchmark.py encountered an error

Have you tried the weights/ckpt from HuggingFace? I remember that I ran the benchmark.py well with Chatglm3 yesterday.

Not yet, I am still downloading the real weights from hugging face, can you reproduce this error using random weights

No, I can't. My commands

python examples/llama/convert_checkpoint.py --model_dir ./tmp/llama-v2-13b-hf/ --output_dir ./tmp/llama2_13b

trtllm-build --model_config ./tmp/llama2_13b/config.json --gemm_plugin float16 --output_dir ./tmp/llama2_13b/engine2 --gpt_attention_plugin float16 --remove_input_padding enable --context_fmha enable --gemm_plugin float16 --max_batch_size 1 --context_fmha_fp32_acc enable --enable_xqa enable --max_input_len 1024 --max_output_len 200 --multi_block_mode enable --paged_kv_cache disable --remove_input_padding enable --use_custom_all_reduce enable --use_fused_mlp

 python3 benchmarks/python/benchmark.py --model llama_13b --mode plugin --input_output_len "1024,200" --warm_up 1  --num_runs 4 --engine_dir ./tmp/llama2_13b/engine2/ --csv --batch_size 1

logs

Allocated 126.00 MiB for execution context memory.
model_name,world_size,num_heads,num_kv_heads,num_layers,hidden_size,vocab_size,precision,batch_size,gpu_weights_percent,input_length,output_length,gpu_peak_mem(gb),build_time(s),tokens_per_sec,percentile95(ms),percentile99(ms),latency(ms),compute_cap,quantization,generation_time(ms),total_generated_tokens,generation_tokens_per_second
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
  return _nested.nested_tensor(
[06/04/2024-10:05:30] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024060400

llama_13b,1,40,40,40,5120,32000,float16,1,1.0,1024,200,25.991,0,45.73,4403.05,4403.05,4373.095,sm90,QuantMode.0,4162.918,199.0,47.803

notice that my CUDA env is

NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4
trtllm 24.04 docker container

It is also recommended to update the TRT-LLM version to the latest (it will be updated to main branch today or tomorrow).

zxs789 commented 3 months ago

I tried chatglm2_6b_tp1 and llama2_13b_tp2 on H20 GPU, and both encountered a divide-by-zero issue. I used random weights, and when I changed all the weights to 1, this problem still occurred. Here are the steps to reproduce it:

System Info Device:H20 Driver:535.161.07 cuda-toolkit:12.2.0

python env: nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 tensorrt 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.11.0.dev2024052800 torch 2.3.0

1. Create the config file:

convert_config/chatglm2_6b/float16/1-gpu/config.json
{
    "architecture": "ChatGLMForCausalLM",
    "dtype": "float16",
    "logits_dtype": "float32",
    "num_hidden_layers": 28,
    "num_attention_heads": 32,
    "num_key_value_heads": 2,
    "hidden_size": 4096,
    "intermediate_size": 13696,
    "norm_epsilon": 1e-05,
    "vocab_size": 65024,
    "position_embedding_type": "rope_gptj",
    "max_position_embeddings": 32768,
    "hidden_act": "swiglu",
    "use_parallel_embedding": false,
    "embedding_sharding_dim": 0,
    "share_embedding_table": false,
    "quantization": {
        "quant_algo": null,
        "kv_cache_quant_algo": null
    },
    "mapping": {
        "world_size": 1,
        "tp_size": 1,
        "pp_size": 1
    },
    "chatglm_version": "chatglm2",
    "add_bias_linear": false,
    "add_qkv_bias": true,
    "apply_query_key_layer_scaling": false,
    "apply_residual_connection_post_layernorm": false,
    "rmsnorm": true,
    "rope_ratio": 1.0
}

2. build the engine with model_config

build.sh

#!/bin/bash
model=chatglm2_6b
tp=1
dtype=fp16

model_config=./convert_config/$model/float16/${tp}-gpu/config.json
output_dir=./engines/$model/trt_engines/fp16/${tp}-gpu
max_batch_size=1
max_input_len=1024
max_output_len=200

trtllm-build \
      --model_config $model_config \
      --gpt_attention_plugin float16 \
      --remove_input_padding enable \
      --context_fmha enable \
      --gemm_plugin float16 \
      --max_batch_size $max_batch_size \
      --output_dir $output_dir \
      --context_fmha_fp32_acc enable \
      --enable_xqa enable \
      --max_input_len $max_input_len \
      --max_output_len $max_output_len \
      --multi_block_mode enable \
      --paged_kv_cache disable \
      --remove_input_padding enable \
      --strongly_typed \
      --use_custom_all_reduce enable \
      --use_fused_mlp \
      --workers $tp

3. run benchmark

python3 benchmarks/python/benchmark.py --model chatglm2_6b --mode plugin --batch_size 1 --input_output_len "1024,200" --warm_up 1 --num_runs 4 --engine_dir ./engines/chatglm2_6b/trt_engines/fp16/1-gpu/

4. Below is the code where I changed all the weights to 1:

diff --git a/tensorrt_llm/models/modeling_utils.py b/tensorrt_llm/models/modeling_utils.py
index 5ea7e04..d571434 100644
--- a/tensorrt_llm/models/modeling_utils.py
+++ b/tensorrt_llm/models/modeling_utils.py
@@ -1186,4 +1186,8 @@ def load_model(
                            from_pruned=is_checkpoint_pruned)
         model.load(weights, from_pruned=is_checkpoint_pruned)

+    for name, param in model.named_parameters():
+        param.set_all_one_dummy()
+        print(name, param.shape, param.print_value())
+
     return model
diff --git a/tensorrt_llm/parameter.py b/tensorrt_llm/parameter.py
index 42dc42b..2e92e63 100644
--- a/tensorrt_llm/parameter.py
+++ b/tensorrt_llm/parameter.py
@@ -140,6 +140,12 @@ class Parameter:

         self.value = v

+    def print_value(self):
+        return self._value
+
+    def set_all_one_dummy(self):
+        self.value = np.ones(self._shape, trt_dtype_to_np(self._dtype))
+
     def _get_weights(self) -> trt.Weights:
         if isinstance(self._value, Tensor):
             self._value.producer.__class__ = trt.IConstantLayer

Thanks again.

this is my commends with random weights,can you run this commends on h20,if it worked,i can update my driver and cuda version,and the tensorrt-llm version

hijkzzz commented 3 months ago

Hi, it also works well with the random weights + chatglm2_6b. It may be an issue of Drvier and CUDA versions.

logs

trtllm-build --model_config ./tmp/chatglm2_6b/ckpt/config.json --gemm_plugin float16 --output_dir ./tmp/chatglm2_6b/engine2 --gpt_attention_plugin float16 --remove_input_padding enable --context_fmha enable --gemm_plugin float16 --max_batch_size 1 --context_fmha_fp32_acc enable --enable_xqa enable --max_input_len 1024 --max_output_len 200 --multi_block_mode enable --paged_kv_cache disable --remove_input_padding enable --use_custom_all_reduce enable --use_fused_mlp

python3 benchmarks/python/benchmark.py --model chatglm2_6b --mode plugin --input_output_len "1024,200" --warm_up 1  --num_runs 4 --engine_dir ./tmp/chatglm2_6b/engine2/ --csv --batch_size 1

...
Allocated 117.50 MiB for execution context memory.
model_name,world_size,num_heads,num_kv_heads,num_layers,hidden_size,vocab_size,precision,batch_size,gpu_weights_percent,input_length,output_length,gpu_peak_mem(gb),build_time(s),tokens_per_sec,percentile95(ms),percentile99(ms),latency(ms),compute_cap,quantization,generation_time(ms),total_generated_tokens,generation_tokens_per_second
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
  return _nested.nested_tensor(
[06/04/2024-10:28:33] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024060400
chatglm2_6b,1,32,2,28,4096,65024,float16,1,1.0,1024,200,12.433,0,105.65,1904.281,1908.865,1893.025,sm90,QuantMode.0,1796.244,199.0,110.78
zxs789 commented 3 months ago

Thanks a lot,i can try update the driver and cuda version。

zxs789 commented 3 months ago

Hi, it also works well with the random weights + chatglm2_6b. It may be an issue of Drvier and CUDA versions.

logs

trtllm-build --model_config ./tmp/chatglm2_6b/ckpt/config.json --gemm_plugin float16 --output_dir ./tmp/chatglm2_6b/engine2 --gpt_attention_plugin float16 --remove_input_padding enable --context_fmha enable --gemm_plugin float16 --max_batch_size 1 --context_fmha_fp32_acc enable --enable_xqa enable --max_input_len 1024 --max_output_len 200 --multi_block_mode enable --paged_kv_cache disable --remove_input_padding enable --use_custom_all_reduce enable --use_fused_mlp

python3 benchmarks/python/benchmark.py --model chatglm2_6b --mode plugin --input_output_len "1024,200" --warm_up 1  --num_runs 4 --engine_dir ./tmp/chatglm2_6b/engine2/ --csv --batch_size 1

...
Allocated 117.50 MiB for execution context memory.
model_name,world_size,num_heads,num_kv_heads,num_layers,hidden_size,vocab_size,precision,batch_size,gpu_weights_percent,input_length,output_length,gpu_peak_mem(gb),build_time(s),tokens_per_sec,percentile95(ms),percentile99(ms),latency(ms),compute_cap,quantization,generation_time(ms),total_generated_tokens,generation_tokens_per_second
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
  return _nested.nested_tensor(
[06/04/2024-10:28:33] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024060400
chatglm2_6b,1,32,2,28,4096,65024,float16,1,1.0,1024,200,12.433,0,105.65,1904.281,1908.865,1893.025,sm90,QuantMode.0,1796.244,199.0,110.78

Can i ask your pip list,thanks a lot

hijkzzz commented 3 months ago

Hi, it also works well with the random weights + chatglm2_6b. It may be an issue of Drvier and CUDA versions. logs

trtllm-build --model_config ./tmp/chatglm2_6b/ckpt/config.json --gemm_plugin float16 --output_dir ./tmp/chatglm2_6b/engine2 --gpt_attention_plugin float16 --remove_input_padding enable --context_fmha enable --gemm_plugin float16 --max_batch_size 1 --context_fmha_fp32_acc enable --enable_xqa enable --max_input_len 1024 --max_output_len 200 --multi_block_mode enable --paged_kv_cache disable --remove_input_padding enable --use_custom_all_reduce enable --use_fused_mlp

python3 benchmarks/python/benchmark.py --model chatglm2_6b --mode plugin --input_output_len "1024,200" --warm_up 1  --num_runs 4 --engine_dir ./tmp/chatglm2_6b/engine2/ --csv --batch_size 1

...
Allocated 117.50 MiB for execution context memory.
model_name,world_size,num_heads,num_kv_heads,num_layers,hidden_size,vocab_size,precision,batch_size,gpu_weights_percent,input_length,output_length,gpu_peak_mem(gb),build_time(s),tokens_per_sec,percentile95(ms),percentile99(ms),latency(ms),compute_cap,quantization,generation_time(ms),total_generated_tokens,generation_tokens_per_second
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
  return _nested.nested_tensor(
[06/04/2024-10:28:33] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024060400
chatglm2_6b,1,32,2,28,4096,65024,float16,1,1.0,1024,200,12.433,0,105.65,1904.281,1908.865,1893.025,sm90,QuantMode.0,1796.244,199.0,110.78

Can i ask your pip list,thanks a lot

I used the clean trtllm-24.04 container

pip list

Package                   Version
------------------------- --------------------------
absl-py                   2.1.0
accelerate                0.30.1
aiohttp                   3.9.3
aiosignal                 1.3.1
annotated-types           0.6.0
apex                      0.1
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
asttokens                 2.4.1
astunparse                1.6.3
async-timeout             4.0.3
attrs                     23.2.0
audioread                 3.0.1
beautifulsoup4            4.12.3
bleach                    6.1.0
blis                      0.7.11
build                     1.2.1
cachetools                5.3.3
catalogue                 2.0.10
certifi                   2024.2.2
cffi                      1.16.0
charset-normalizer        3.3.2
click                     8.1.7
cloudpathlib              0.16.0
cloudpickle               3.0.0
cmake                     3.29.0.1
colored                   2.2.4
coloredlogs               15.0.1
comm                      0.2.2
confection                0.1.4
contourpy                 1.2.1
cuda-python               12.4.0rc7+3.ge75c8a9.dirty
cudf                      24.2.0
cudnn                     1.1.2
cugraph                   24.2.0
cugraph-dgl               24.2.0
cugraph-service-client    24.2.0
cugraph-service-server    24.2.0
cuml                      24.2.0
cupy-cuda12x              13.0.0
cycler                    0.12.1
cymem                     2.0.8
Cython                    3.0.10
dask                      2024.1.1
dask-cuda                 24.2.0
dask-cudf                 24.2.0
datasets                  2.19.2
debugpy                   1.8.1
decorator                 5.1.1
defusedxml                0.7.1
diffusers                 0.28.0
dill                      0.3.8
distributed               2024.1.1
dm-tree                   0.1.8
einops                    0.7.0
evaluate                  0.4.2
exceptiongroup            1.2.0
execnet                   2.0.2
executing                 2.0.1
expecttest                0.1.3
fastjsonschema            2.19.1
fastrlock                 0.8.2
filelock                  3.13.3
flash-attn                2.4.2
fonttools                 4.51.0
frozenlist                1.4.1
fsspec                    2024.2.0
gast                      0.5.4
google-auth               2.29.0
google-auth-oauthlib      0.4.6
graphsurgeon              0.4.6
grpcio                    1.62.1
h5py                      3.10.0
huggingface-hub           0.23.2
humanfriendly             10.0
hypothesis                5.35.1
idna                      3.6
igraph                    0.11.4
importlib_metadata        7.0.2
iniconfig                 2.0.0
intel-openmp              2021.4.0
ipykernel                 6.29.4
ipython                   8.21.0
ipython-genutils          0.2.0
janus                     1.0.0
jedi                      0.19.1
Jinja2                    3.1.3
joblib                    1.3.2
json5                     0.9.24
jsonschema                4.21.1
jsonschema-specifications 2023.12.1
jupyter_client            8.6.1
jupyter_core              5.7.2
jupyter-tensorboard       0.2.0
jupyterlab                2.3.2
jupyterlab_pygments       0.3.0
jupyterlab-server         1.2.0
jupytext                  1.16.1
kiwisolver                1.4.5
langcodes                 3.3.0
lark                      1.1.9
lazy_loader               0.4
librosa                   0.10.1
lightning-thunder         0.1.0
lightning-utilities       0.11.2
llvmlite                  0.42.0
locket                    1.0.0
looseversion              1.3.0
Markdown                  3.6
markdown-it-py            3.0.0
MarkupSafe                2.1.5
matplotlib                3.8.4
matplotlib-inline         0.1.6
mdit-py-plugins           0.4.0
mdurl                     0.1.2
mistune                   3.0.2
mkl                       2021.1.1
mkl-devel                 2021.1.1
mkl-include               2021.1.1
mock                      5.1.0
mpi4py                    3.1.5
mpmath                    1.3.0
msgpack                   1.0.8
multidict                 6.0.5
multiprocess              0.70.16
murmurhash                1.0.10
nbclient                  0.10.0
nbconvert                 7.16.3
nbformat                  5.10.4
nest-asyncio              1.6.0
networkx                  2.6.3
ninja                     1.11.1.1
notebook                  6.4.10
numba                     0.59.0+1.g20ae2b56c
numpy                     1.24.4
nvfuser                   0.1.6a0+a684e2a
nvidia-dali-cuda120       1.36.0
nvidia-modelopt           0.11.2
nvidia-nvimgcodec-cu12    0.2.0.7
nvidia-pyindex            1.0.9
nvtx                      0.2.5
oauthlib                  3.2.2
onnx                      1.16.0
opencv                    4.7.0
opt-einsum                3.3.0
optimum                   1.20.0
optree                    0.11.0
packaging                 23.2
pandas                    1.5.3
pandocfilters             1.5.1
parso                     0.8.4
partd                     1.4.1
pexpect                   4.9.0
pillow                    10.2.0
pip                       24.0
platformdirs              4.2.0
pluggy                    1.4.0
ply                       3.11
polygraphy                0.49.9
pooch                     1.8.1
preshed                   3.0.9
prettytable               3.10.0
prometheus_client         0.20.0
prompt-toolkit            3.0.43
protobuf                  4.24.4
psutil                    5.9.4
ptyprocess                0.7.0
PuLP                      2.8.0
pure-eval                 0.2.2
pyarrow                   14.0.1
pyarrow-hotfix            0.6
pyasn1                    0.6.0
pyasn1_modules            0.4.0
pybind11                  2.12.0
pybind11_global           2.12.0
pycocotools               2.0+nv0.8.0
pycparser                 2.22
pydantic                  2.6.4
pydantic_core             2.16.3
Pygments                  2.17.2
pylibcugraph              24.2.0
pylibcugraphops           24.2.0
pylibraft                 24.2.0
pynvjitlink               0.1.13
pynvml                    11.5.0
pyparsing                 3.1.2
pyproject_hooks           1.1.0
pytest                    8.1.1
pytest-flakefinder        1.1.0
pytest-rerunfailures      14.0
pytest-shard              0.1.2
pytest-xdist              3.5.0
python-dateutil           2.9.0.post0
python-hostlist           1.23.0
pytorch-quantization      2.1.2
pytorch-triton            3.0.0+a9bc1a364
pytz                      2024.1
PyYAML                    6.0.1
pyzmq                     25.1.2
raft-dask                 24.2.0
rapids-dask-dependency    24.2.0a0
referencing               0.34.0
regex                     2023.12.25
requests                  2.32.3
requests-oauthlib         2.0.0
rich                      13.7.1
rmm                       24.2.0
rpds-py                   0.18.0
rsa                       4.9
safetensors               0.4.3
scikit-learn              1.2.0
scipy                     1.12.0
Send2Trash                1.8.2
sentencepiece             0.2.0
setuptools                68.2.2
six                       1.16.0
smart-open                6.4.0
sortedcontainers          2.4.0
soundfile                 0.12.1
soupsieve                 2.5
soxr                      0.3.7
spacy                     3.7.4
spacy-legacy              3.0.12
spacy-loggers             1.0.5
sphinx_glpi_theme         0.6
srsly                     2.4.8
stack-data                0.6.3
StrEnum                   0.4.15
sympy                     1.12
tabulate                  0.9.0
tbb                       2021.12.0
tblib                     3.0.0
tensorboard               2.9.0
tensorboard-data-server   0.6.1
tensorboard-plugin-wit    1.8.1
tensorrt                  10.0.1
tensorrt-llm              0.11.0.dev2024060400
terminado                 0.18.1
texttable                 1.7.0
thinc                     8.2.3
threadpoolctl             3.3.0
thriftpy2                 0.4.17
tinycss2                  1.2.1
tokenizers                0.19.1
toml                      0.10.2
tomli                     2.0.1
toolz                     0.12.1
torch                     2.3.0a0+6ddf5cf85e.nv24.4
torch-tensorrt            2.3.0a0
torchdata                 0.7.1a0
torchtext                 0.17.0a0
torchvision               0.18.0a0
tornado                   6.4
tqdm                      4.66.2
traitlets                 5.9.0
transformer-engine        1.5.0+6a9edc3
transformers              4.40.2
treelite                  4.0.0
typer                     0.9.4
types-dataclasses         0.6.6
typing_extensions         4.10.0
ucx-py                    0.36.0
uff                       0.6.9
urllib3                   1.26.18
wasabi                    1.1.2
wcwidth                   0.2.13
weasel                    0.3.4
webencodings              0.5.1
Werkzeug                  3.0.2
wheel                     0.43.0
xdoctest                  1.0.2
xgboost                   2.0.3
xxhash                    3.4.1
yarl                      1.9.4
zict                      3.0.0
zipp                      3.17.0
zxs789 commented 3 months ago

Hi, it also works well with the random weights + chatglm2_6b. It may be an issue of Drvier and CUDA versions. logs

trtllm-build --model_config ./tmp/chatglm2_6b/ckpt/config.json --gemm_plugin float16 --output_dir ./tmp/chatglm2_6b/engine2 --gpt_attention_plugin float16 --remove_input_padding enable --context_fmha enable --gemm_plugin float16 --max_batch_size 1 --context_fmha_fp32_acc enable --enable_xqa enable --max_input_len 1024 --max_output_len 200 --multi_block_mode enable --paged_kv_cache disable --remove_input_padding enable --use_custom_all_reduce enable --use_fused_mlp

python3 benchmarks/python/benchmark.py --model chatglm2_6b --mode plugin --input_output_len "1024,200" --warm_up 1  --num_runs 4 --engine_dir ./tmp/chatglm2_6b/engine2/ --csv --batch_size 1

...
Allocated 117.50 MiB for execution context memory.
model_name,world_size,num_heads,num_kv_heads,num_layers,hidden_size,vocab_size,precision,batch_size,gpu_weights_percent,input_length,output_length,gpu_peak_mem(gb),build_time(s),tokens_per_sec,percentile95(ms),percentile99(ms),latency(ms),compute_cap,quantization,generation_time(ms),total_generated_tokens,generation_tokens_per_second
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
  return _nested.nested_tensor(
[06/04/2024-10:28:33] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024060400
chatglm2_6b,1,32,2,28,4096,65024,float16,1,1.0,1024,200,12.433,0,105.65,1904.281,1908.865,1893.025,sm90,QuantMode.0,1796.244,199.0,110.78

Can i ask your pip list,thanks a lot

I used the clean trtllm-24.04 container

nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 do you mean this docker?

hijkzzz commented 3 months ago

Hi, it also works well with the random weights + chatglm2_6b. It may be an issue of Drvier and CUDA versions. logs

trtllm-build --model_config ./tmp/chatglm2_6b/ckpt/config.json --gemm_plugin float16 --output_dir ./tmp/chatglm2_6b/engine2 --gpt_attention_plugin float16 --remove_input_padding enable --context_fmha enable --gemm_plugin float16 --max_batch_size 1 --context_fmha_fp32_acc enable --enable_xqa enable --max_input_len 1024 --max_output_len 200 --multi_block_mode enable --paged_kv_cache disable --remove_input_padding enable --use_custom_all_reduce enable --use_fused_mlp

python3 benchmarks/python/benchmark.py --model chatglm2_6b --mode plugin --input_output_len "1024,200" --warm_up 1  --num_runs 4 --engine_dir ./tmp/chatglm2_6b/engine2/ --csv --batch_size 1

...
Allocated 117.50 MiB for execution context memory.
model_name,world_size,num_heads,num_kv_heads,num_layers,hidden_size,vocab_size,precision,batch_size,gpu_weights_percent,input_length,output_length,gpu_peak_mem(gb),build_time(s),tokens_per_sec,percentile95(ms),percentile99(ms),latency(ms),compute_cap,quantization,generation_time(ms),total_generated_tokens,generation_tokens_per_second
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
  return _nested.nested_tensor(
[06/04/2024-10:28:33] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024060400
chatglm2_6b,1,32,2,28,4096,65024,float16,1,1.0,1024,200,12.433,0,105.65,1904.281,1908.865,1893.025,sm90,QuantMode.0,1796.244,199.0,110.78

Can i ask your pip list,thanks a lot

I used the clean trtllm-24.04 container

nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 do you mean this docker?

That should be OK. nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 is better.

zxs789 commented 3 months ago

OK,I can pull this docker and update my driver version, thanks again.

RobinJing commented 3 months ago

Hi, there Can I ask how to get the 0604 version: tensorrt-llm 0.11.0.dev2024060400 Thanks a lot! BR

zxs789 commented 3 months ago

Hi, there Can I ask how to get the 0604 version: tensorrt-llm 0.11.0.dev2024060400 Thanks a lot! BR

Hi, you can download this:https://pypi.nvidia.com/tensorrt-llm/tensorrt_llm-0.11.0.dev2024060400-cp310-cp310-linux_x86_64.whl

RobinJYM commented 3 months ago

Thanks for your reply. Using tensorrt-llm 0.11.0.dev2024060400 with triton 2404 and 2405, have you met the tensorrt not found issue? I have installed tensorrt and it is in pip list: image

zxs789 commented 3 months ago

Thanks for your reply. Using tensorrt-llm 0.11.0.dev2024060400 with triton 2404 and 2405, have you met the tensorrt not found issue? I have installed tensorrt and it is in pip list: image

No,I didn't met this error, i install tensorrt with release TensorRT-10.0.1.6.Linux.x86_64-gnu.cuda-12.4.tar.gz

zxs789 commented 3 months ago

@hijkzzz It run successed on H20 with driver:550.54.15 + cuda:12.4.1 + torch:torch-2.4.0a0+07cecf4.nv24.5-cp310 + trtllm-0604