NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.11k stars 896 forks source link

Failed to build engine with the 70B sq_int model #1478

Open Opdoop opened 4 months ago

Opdoop commented 4 months ago

System Info

4*A800 80G

Who can help?

@Tracin

Information

Tasks

Reproduction

  1. Build a trt-llm docker image with TensorRT-LLM v0.9.0:71d8d4d and start the container:
    docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864  \
           --gpus '"device=0,1,2,3"' \
           --volume /mnt:/mnt/\
           --workdir /app/tensorrt_llm \
           --name tensorrt_llm-release \
           --tmpfs /tmp:exec \
           trt_llm:main_71d8d4d
  2. Quant LLAMA2-70B mode sq_int8l:
    python ../quantization/quantize.py --model_dir llama2-70b \
                                   --output_dir quant/llama2-70b_int8-sq \
                                   --dtype float16 \
                                   --qformat int8_sq \
                                   --awq_block_size 128 \
                                   --calib_size 512 \
                                    --batch_size 8 \
                                    --tp_size 4
  3. Build trt-llm engine:
    
    trtllm-build --checkpoint_dir quant/llama2-70b_int8-sq \
             --output_dir /trt_llm/llama2-70b_int8-sq \
             --gemm_plugin float16 \
                         --max_num_tokens 409600 \
                         --max_batch_size  128 \
             --max_input_len 8192 \
             --max_output_len 2048 \
                         --logits_dtype float32 \
                         --gpt_attention_plugin float16 \
                         --remove_input_padding enable 

Get error when build engine

### Expected behavior

Success build engine

### actual behavior

Encounter a bug, `AssertionError: Engine building failed, please check error log.` Full output logs are in below:

<details>

<summary>Logs</summary>

[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600 [04/19/2024-14:10:57] [TRT-LLM] [I] Set bert_attention_plugin to float16. [04/19/2024-14:10:57] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [04/19/2024-14:10:57] [TRT-LLM] [I] Set gemm_plugin to float16. [04/19/2024-14:10:57] [TRT-LLM] [I] Set nccl_plugin to float16. [04/19/2024-14:10:57] [TRT-LLM] [I] Set lookup_plugin to None. [04/19/2024-14:10:57] [TRT-LLM] [I] Set lora_plugin to None. [04/19/2024-14:10:57] [TRT-LLM] [I] Set moe_plugin to float16. [04/19/2024-14:10:57] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16. [04/19/2024-14:10:57] [TRT-LLM] [I] Set context_fmha to True. [04/19/2024-14:10:57] [TRT-LLM] [I] Set context_fmha_fp32_acc to True. [04/19/2024-14:10:57] [TRT-LLM] [I] Set paged_kv_cache to True. [04/19/2024-14:10:57] [TRT-LLM] [I] Set remove_input_padding to True. [04/19/2024-14:10:57] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/19/2024-14:10:57] [TRT-LLM] [I] Set multi_block_mode to True. [04/19/2024-14:10:57] [TRT-LLM] [I] Set enable_xqa to True. [04/19/2024-14:10:57] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [04/19/2024-14:10:57] [TRT-LLM] [I] Set tokens_per_block to 128. [04/19/2024-14:10:57] [TRT-LLM] [I] Set use_paged_context_fmha to True. [04/19/2024-14:10:57] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [04/19/2024-14:10:57] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [04/19/2024-14:10:57] [TRT-LLM] [I] Set multiple_profiles to False. [04/19/2024-14:10:57] [TRT-LLM] [I] Set paged_state to True. [04/19/2024-14:10:57] [TRT-LLM] [I] Set streamingllm to False. [04/19/2024-14:10:57] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[04/19/2024-14:10:57] [TRT-LLM] [I] Compute capability: (8, 0) [04/19/2024-14:10:57] [TRT-LLM] [I] SM count: 108 [04/19/2024-14:10:57] [TRT-LLM] [I] SM clock: 1410 MHz [04/19/2024-14:10:57] [TRT-LLM] [I] int4 TFLOPS: 1247 [04/19/2024-14:10:57] [TRT-LLM] [I] int8 TFLOPS: 623 [04/19/2024-14:10:57] [TRT-LLM] [I] fp8 TFLOPS: 0 [04/19/2024-14:10:57] [TRT-LLM] [I] float16 TFLOPS: 311 [04/19/2024-14:10:57] [TRT-LLM] [I] bfloat16 TFLOPS: 311 [04/19/2024-14:10:57] [TRT-LLM] [I] float32 TFLOPS: 155 [04/19/2024-14:10:57] [TRT-LLM] [I] Total Memory: 80 GiB [04/19/2024-14:10:57] [TRT-LLM] [I] Memory clock: 1593 MHz [04/19/2024-14:10:57] [TRT-LLM] [I] Memory bus width: 5120 [04/19/2024-14:10:57] [TRT-LLM] [I] Memory bandwidth: 2039 GB/s [04/19/2024-14:10:57] [TRT-LLM] [I] NVLink is active: True [04/19/2024-14:10:57] [TRT-LLM] [I] NVLink version: 5 [04/19/2024-14:10:57] [TRT-LLM] [I] NVLink bandwidth: 300 GB/s [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600 [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600 [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600 [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024041600 [04/19/2024-14:11:06] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 452, GPU 427 (MiB) [04/19/2024-14:11:08] [TRT] [I] [MemUsageChange] Init CUDA: CPU +1, GPU +0, now: CPU 456, GPU 427 (MiB) [04/19/2024-14:11:08] [TRT] [I] [MemUsageChange] Init CUDA: CPU +1, GPU +0, now: CPU 456, GPU 427 (MiB) [04/19/2024-14:11:08] [TRT] [I] [MemUsageChange] Init CUDA: CPU +1, GPU +0, now: CPU 456, GPU 427 (MiB) [04/19/2024-14:11:28] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2561, GPU 777 (MiB) [04/19/2024-14:11:29] [TRT-LLM] [I] Set nccl_plugin to float16. [04/19/2024-14:11:29] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/PLUGIN_V2_AllReduce_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2565, GPU 777 (MiB) [04/19/2024-14:11:29] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2565, GPU 777 (MiB) [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.[04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/1/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/1/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/1/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/1/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/1/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/1/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/1/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.[04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/1/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/2/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/2/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/2/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/2/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/2/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT-LLM] [I] Set nccl_plugin to float16. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/2/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/2/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.[04/19/2024-14:11:29] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/19/2024-14:11:29] [TRT-LLM] [I] Set nccl_plugin to float16. [04/19/2024-14:11:29] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/2/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/3/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/3/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/3/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/PLUGIN_V2_AllReduce_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/PLUGIN_V2_AllReduce_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/3/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/3/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/3/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/3/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.[04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/3/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/4/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/4/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/4/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/4/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/4/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/4/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/4/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.[04/19/2024-14:11:29] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/4/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/5/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:36] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/ln_f/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/ln_f/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [04/19/2024-14:11:36] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 2655, GPU 813 (MiB) [04/19/2024-14:11:36] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [04/19/2024-14:11:36] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2659, GPU 813 (MiB) [04/19/2024-14:11:36] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [04/19/2024-14:11:36] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [04/19/2024-14:11:36] [TRT] [W] Unused Input: position_ids [04/19/2024-14:11:36] [TRT] [W] Detected layernorm nodes in FP16. [04/19/2024-14:11:36] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [04/19/2024-14:11:37] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [04/19/2024-14:11:37] [TRT] [W] Detected layernorm nodes in FP16. [04/19/2024-14:11:37] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [04/19/2024-14:11:37] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [04/19/2024-14:11:37] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2658, GPU 803 (MiB) [04/19/2024-14:11:38] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2658, GPU 803 (MiB) [04/19/2024-14:11:38] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2659, GPU 813 (MiB) [04/19/2024-14:11:38] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2659, GPU 813 (MiB) [04/19/2024-14:11:38] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [04/19/2024-14:11:38] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [04/19/2024-14:11:40] [TRT] [E] 9: Skipping tactic 0x0000000000000000 due to exception [autotuner.cpp:get_best_tactics:1520] Autotuner: no tactics to implement operation: 360: fc: castmye406-(f32[mye172_sp,2560][]so[], mem_prop=0) | LLaMAForCausalLM/transformer/layers/0/attention/qkv/QUANTIZE_0_output_0'.1-(i8[mye172_sp,8192][]so[], mem_prop=0), mye367_dconst-{0, 0, 0, 0, 0, 0, 0, 0, ...}(i8[8192,2560][2560,1]so[1,0], mem_prop=0), mye370_dconst-{5.54005e-06, 1.75586e-05, 4.26856e-06, 9.56643e-06, 2.4461e-05, [04/19/2024-14:11:40] [TRT] [E] 9: Skipping tactic 0x0000000000000000 due to exception [autotuner.cpp:get_best_tactics:1520] Autotuner: no tactics to implement operation: 360: fc: castmye406-(f32[mye172_sp,2560][]so[], mem_prop=0) | LLaMAForCausalLM/transformer/layers/0/attention/qkv/QUANTIZE_0_output_0'.1-(i8[mye172_sp,8192][]so[], mem_prop=0), mye367_dconst-{0, 0, 0, 0, 0, 0, 0, 0, ...}(i8[8192,2560][2560,1]so[1,0], mem_prop=0), mye370_dconst-{4.8922e-05, 2.0586e-05, 1.40469e-05, 1.46524e-05, 5.06173e-05, 4 [04/19/2024-14:11:40] [TRT] [E] 10: Could not find any implementation for node {ForeignNode[LLaMAForCausalLM/transformer/layers/0/input_layernorm/CONSTANT_1 + LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0...LLaMAForCausalLM/transformer/layers/0/attention/qkv/CAST_5]}. [04/19/2024-14:11:41] [TRT] [E] 10: Could not find any implementation for node {ForeignNode[LLaMAForCausalLM/transformer/layers/0/input_layernorm/CONSTANT_1 + LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0...LLaMAForCausalLM/transformer/layers/0/attention/qkv/CAST_5]}. [04/19/2024-14:11:41] [TRT] [E] 10: [optimizer.cpp::computeCosts::4048] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[LLaMAForCausalLM/transformer/layers/0/input_layernorm/CONSTANT_1 + LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0...LLaMAForCausalLM/transformer/layers/0/attention/qkv/CAST_5]}.) [04/19/2024-14:11:41] [TRT-LLM] [E] Engine building failed, please check the error log. [04/19/2024-14:11:41] [TRT] [E] 10: [optimizer.cpp::computeCosts::4048] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[LLaMAForCausalLM/transformer/layers/0/input_layernorm/CONSTANT_1 + LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0...LLaMAForCausalLM/transformer/layers/0/attention/qkv/CAST_5]}.) [04/19/2024-14:11:41] [TRT-LLM] [E] Engine building failed, please check the error log. [04/19/2024-14:11:41] [TRT] [I] Serialized 59 bytes of code generator cache. [04/19/2024-14:11:41] [TRT] [I] Serialized 24982 bytes of compilation cache. [04/19/2024-14:11:41] [TRT] [I] Serialized 0 timing cache entries [04/19/2024-14:11:41] [TRT-LLM] [I] Timing cache serialized to model.cache [04/19/2024-14:11:41] [TRT-LLM] [I] Serializing engine to trt_llm/hoyollama2-70b_int8-sq/rank3.engine... concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 308, in build_and_save engine.save(output_dir) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 569, in save serialize_engine( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 105, in serialize_engine f.write(engine) TypeError: a bytes-like object is required, not 'NoneType' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 357, in parallel_build future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception TypeError: a bytes-like object is required, not 'NoneType' [04/19/2024-14:11:41] [TRT-LLM] [I] Serializing engine to trt_llm/hoyollama2-70b_int8-sq/rank0.engine... concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 308, in build_and_save engine.save(output_dir) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 569, in save serialize_engine( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 105, in serialize_engine f.write(engine) TypeError: a bytes-like object is required, not 'NoneType' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 357, in parallel_build future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception TypeError: a bytes-like object is required, not 'NoneType' [04/19/2024-14:11:43] [TRT] [E] 9: Skipping tactic 0x0000000000000000 due to exception [autotuner.cpp:get_best_tactics:1520] Autotuner: no tactics to implement operation: 360: fc: castmye406-(f32[mye172_sp,2560][]so[], mem_prop=0) | LLaMAForCausalLM/transformer/layers/0/attention/qkv/QUANTIZE_0_output_0'.1-(i8[mye172_sp,8192][]so[], mem_prop=0), mye367_dconst-{0, 0, 0, 0, 0, 0, 0, 0, ...}(i8[8192,2560][2560,1]so[1,0], mem_prop=0), __mye370_dconst-{5.11017e-05, 3.01524e-05, 8.83986e-06, 1.58633e-05, 1.13223e-05, [04/19/2024-14:11:43] [TRT] [E] 10: Could not find any implementation for node {ForeignNode[LLaMAForCausalLM/transformer/layers/0/input_layernorm/CONSTANT_1 + LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0...LLaMAForCausalLM/transformer/layers/0/attention/qkv/CAST_5]}. [04/19/2024-14:11:43] [TRT] [E] 10: [optimizer.cpp::computeCosts::4048] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[LLaMAForCausalLM/transformer/layers/0/input_layernorm/CONSTANT_1 + LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0...LLaMAForCausalLM/transformer/layers/0/attention/qkv/CAST_5]}.) [04/19/2024-14:11:43] [TRT-LLM] [E] Engine building failed, please check the error log. [04/19/2024-14:11:43] [TRT-LLM] [I] Serializing engine to trt_llm/hoyollama2-70b_int8-sq/rank2.engine... concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 308, in build_and_save engine.save(output_dir) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 569, in save serialize_engine( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 105, in serialize_engine f.write(engine) TypeError: a bytes-like object is required, not 'NoneType' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 357, in parallel_build future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception TypeError: a bytes-like object is required, not 'NoneType' [04/19/2024-14:11:44] [TRT] [E] 9: Skipping tactic 0x0000000000000000 due to exception [autotuner.cpp:get_best_tactics:1520] Autotuner: no tactics to implement operation: 360: fc: castmye406-(f32[mye172_sp,2560][]so[], mem_prop=0) | LLaMAForCausalLM/transformer/layers/0/attention/qkv/QUANTIZE_0_output_0'.1-(i8[mye172_sp,8192][]so[], mem_prop=0), mye367_dconst-{0, 0, 0, 0, 0, 0, 0, 0, ...}(i8[8192,2560][2560,1]so[1,0], mem_prop=0), __mye370_dconst-{3.10001e-05, 4.31095e-05, 5.25548e-05, 1.49551e-05, 3.68126e-05, [04/19/2024-14:11:45] [TRT] [E] 10: Could not find any implementation for node {ForeignNode[LLaMAForCausalLM/transformer/layers/0/input_layernorm/CONSTANT_1 + LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0...LLaMAForCausalLM/transformer/layers/0/attention/qkv/CAST_5]}. [04/19/2024-14:11:45] [TRT] [E] 10: [optimizer.cpp::computeCosts::4048] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[LLaMAForCausalLM/transformer/layers/0/input_layernorm/CONSTANT_1 + LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0...LLaMAForCausalLM/transformer/layers/0/attention/qkv/CAST_5]}.) [04/19/2024-14:11:45] [TRT-LLM] [E] Engine building failed, please check the error log. [04/19/2024-14:11:45] [TRT-LLM] [I] Serializing engine to trt_llm/hoyollama2-70b_int8-sq/rank1.engine... concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 308, in build_and_save engine.save(output_dir) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 569, in save serialize_engine( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 105, in serialize_engine f.write(engine) TypeError: a bytes-like object is required, not 'NoneType' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 357, in parallel_build future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception TypeError: a bytes-like object is required, not 'NoneType' Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 454, in main parallel_build(source, build_config, args.output_dir, workers, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 361, in parallel_build assert len(exceptions AssertionError: Engine building failed, please check error log



</details>

### additional notes

n/a
byshiue commented 4 months ago

Could you try adding --strongly_typed --builder_opt=4?