NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.3k stars 927 forks source link

Fail to build w4a8_awq/int4_awq on Llama3-8B #1792

Open Hongbosherlock opened 3 months ago

Hongbosherlock commented 3 months ago

System Info

ubuntu 20.04 tensorrt 10.0.1 tensorrt-cu12 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.11.0.dev2024052100

nvidia L40s

Who can help?

@Barry-Delaney @Tracin @byshiue

Information

Tasks

Reproduction

using w4a8_awq

python examples/quantization/quantize.py --model_dir /target/model/llama3_8B \
                                       --dtype float16 \
                                       --qformat int4_awq \
                                       --awq_block_size 128 \
                                       --output_dir /target/model/quantized_w4a8-awq \
                                       --calib_size 32

build

trtllm-build    --checkpoint_dir /target/model/quantized_w4a8-awq \
                 --output_dir /target/model/trt_engines/w4a8_AWQ/1-gpu/ \
                 --gemm_plugin float16 \
                --gpt_attention_plugin float16 \
                --context_fmha enable \
                --remove_input_padding enable \
                --paged_kv_cache enable \
                --max_batch_size 50 \
                --max_input_len 3000 \
                --max_output_len 3000

run:

python3 run.py --max_output_len=500 \
               --tokenizer_dir=/target/model/llama3_8B \
               --engine_dir=/target/model/trt_engines/w4a8_AWQ/1-gpu/ 

Expected behavior

get inference result

actual behavior

got error message

[TensorRT-LLM][INFO] Loaded engine size: 5446 MiB
[TensorRT-LLM][INFO] Allocated 15836.34 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5442 (MiB)
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 94
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 171008. Allocating 22414360576 bytes.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: No valid weight only groupwise GEMM tactic(It is usually caused by the failure to execute all candidate configurations of the CUTLASS kernel, please pay attention to the warning information when building the engine.) (/target/cpp/tensorrt_llm/plugins/weightOnlyGroupwiseQuantMatmulPlugin/weightOnlyGroupwiseQuantMatmulPlugin.cpp:452)
1       0x7f4070d08073 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x56073) [0x7f4070d08073]
2       0x7f4070d9f9da tensorrt_llm::plugins::WeightOnlyGroupwiseQuantMatmulPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 1786
3       0x7f417278cbec /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1060bec) [0x7f417278cbec]
4       0x7f4172742217 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1016217) [0x7f4172742217]
5       0x7f4172743b79 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1017b79) [0x7f4172743b79]
6       0x7f40a0fd7f34 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
7       0x7f40a0fd8496 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) + 502
8       0x7f40a0fe5094 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 2164
9       0x7f40a1009214 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 100
10      0x7f40a100b49c tensorrt_llm::executor::Executor::Impl::executionLoop() + 380
11      0x7f42590b0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f42590b0253]
12      0x7f43035ceb43 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f43035ceb43]
13      0x7f430365fbb4 clone + 68

when building the engine before,I got warnings like:

[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=1, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=2, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=4, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=8, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=16, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=32, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=64, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=128, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=256, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=512, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=1024, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=2048, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=4096, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=8192, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=1, n=1024, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=2, n=1024, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=4, n=1024, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=8, n=1024, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=16, n=1024, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=32, n=1024, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=64, n=1024, k=4096). Will try to use default or fail at runtime

additional notes

When I try the same way with int4_awq,I got errors when trying trtllm-build:

void cutlass::gemm::kernel::GemmFpAIntB<Mma_, Epilogue_, ThreadblockSwizzle_, KernelArch, SplitKSerial>::run_kernel(const cutlass::gemm::kernel::GemmFpAIntB<Mma_, Epilogue_, ThreadblockSwizzle_, KernelArch, SplitKSerial>::Params &, cutlass::gemm::kernel::GemmFpAIntB<Mma_, Epilogue_, ThreadblockSwizzle_, KernelArch, SplitKSerial>::SharedStorage &) [with CompilationArch = cutlass::arch::Sm80; Mma_ = cutlass::gemm::threadblock::DqMmaMultistage<cutlass::gemm::GemmShape<16, 128, 64>, 
cutlass::transform::threadblock::PredicatedTileAccessIterator<cutlass::MatrixShape<16, 64>, cutlass::half_t, cutlass::layout::RowMajor, 1, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<64, 16>, 128, cutlass::PitchLinearShape<8, 4>, 8>, cutlass::Array<cutlass::half_t, 8, false>, false, cutlass::layout::NoPermute>, cutlass::transform::threadblock::RegularTileAccessIterator<cutlass::MatrixShape<16, 64>, cutlass::half_t, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<16, 64>, 0, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<64, 16>, 128, cutlass::PitchLinearShape<8, 4>, 8>, 16>, cutlass::arch::CacheOperation::Global, cutlass::transform::threadblock::PredicatedTileAccessIterator<cutlass::MatrixShape<256, 32>, cutlass::integer_subbyte<4, false>, cutlass::layout::ColumnMajor, 0, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<256, 32>, 128, cutlass::PitchLinearShape<8, 4>, 32>, cutlass::Array<cutlass::integer_subbyte<4, false>, 32, false>, false, cutlass::layout::NoPermute>, cutlass::transform::threadblock::RegularTileAccessIterator<cutlass::MatrixShape<64, 128>, cutlass::integer_subbyte<4, false>, cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<4, 64>, 1, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<64, 128>, 128, cutlass::PitchLinearShape<2, 16>, 32>, 16>, cutlass::arch::CacheOperation::Global, cutlass::transform::threadblock::FineGrainedScaleZeroIterator<cutlass::MatrixShape<1, 128>, cutlass::half_t, cutlass::layout::RowMajor, 0, 8>, cutlass::transform::threadblock::FineGrainedScaleZeroIterator<cutlass::MatrixShape<1, 128>, cutlass::half_t, cutlass::layout::RowMajor, 0, 8>, float, cutlass::layout::RowMajor, 
cutlass::gemm::threadblock::MmaPolicy<cutlass::gemm::warp::MmaTensorOpComputeBWithF16<cutlass::gemm::GemmShape<16, 32, 64>, cutlass::half_t, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<16, 64>, cutlass::integer_subbyte<4, false>, cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<4, 64>, float, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 16>, 32, cutlass::half_t, cutlass::layout::RowMajor, cutlass::half_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1>>, cutlass::gemm::GemmShape<16, 8, 32>, 1, false, __nv_bool>, cutlass::MatrixShape<0, 0>, cutlass::MatrixShape<0, 0>, 1>, 3, cutlass::FastInterleavedAndBiasedNumericArrayConverter<cutlass::half_t, cutlass::integer_subbyte<4, false>, 32>, cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_ONLY, cutlass::gemm::SharedMemoryClearOption::kNone, void>; Epilogue_ = cutlass::epilogue::threadblock::Epilogue<cutlass::gemm::GemmShape<16, 128, 64>, cutlass::gemm::warp::MmaTensorOpComputeBWithF16<cutlass::gemm::GemmShape<16, 32, 64>, cutlass::half_t, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<16, 64>, cutlass::integer_subbyte<4, false>, cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<4, 64>, float, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 16>, 32, cutlass::half_t, cutlass::layout::RowMajor, cutlass::half_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1>>, cutlass::gemm::GemmShape<16, 8, 32>, 1, false, __nv_bool>, 1, cutlass::epilogue::threadblock::PredicatedTileIterator<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<128, 8, 1, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 2, 1, 1, 2>, 128, 8, 16>, cutlass::half_t, false, cutlass::layout::NoPermute, false>, cutlass::epilogue::warp::FragmentIteratorTensorOp<cutlass::gemm::GemmShape<16, 32, 64>, cutlass::gemm::GemmShape<16, 8, 16>, float, cutlass::Array<float, 4, true>, cutlass::layout::RowMajor>, cutlass::epilogue::warp::TileIteratorTensorOpMixed<cutlass::gemm::GemmShape<16, 32, 64>, cutlass::gemm::GemmShape<16, 8, 16>, float, 32, 16, 8, 8, false>, cutlass::epilogue::threadblock::SharedLoadIteratorMixed<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<128, 8, 1, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 2, 1, 1, 2>, 128, 8, 16>::CompactedThreadMap, float, 32, 16, 8, 8, false>, cutlass::epilogue::thread::LinearCombination<cutlass::half_t, 8, float, float, cutlass::epilogue::thread::ScaleType::NoBetaScaling, cutlass::FloatRoundStyle::round_to_nearest, cutlass::half_t>, cutlass::MatrixShape<0, 8>, 2, 1>; ThreadblockSwizzle_ = cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>; KernelArch = cutlass::arch::Sm89; __nv_bool SplitKSerial = true] not implemented

and also:

[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=1, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=2, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=4, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=8, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=16, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=32, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=64, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=128, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=256, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=512, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=1024, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=2048, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=4096, n=1536, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=8192, n=1536, k=4096). Will try to use default or fail at runtime
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: Can't free tmp workspace for GEMM tactics profiling. (/target/cpp/tensorrt_llm/plugins/common/gemmPluginProfiler.cpp:190)
1       0x7f72fcd08073 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x56073) [0x7f72fcd08073]
2       0x7f72fcdc38ea tensorrt_llm::plugins::GemmPluginProfiler<tensorrt_llm::cutlass_extensions::CutlassGemmConfig, std::shared_ptr<tensorrt_llm::kernels::cutlass_kernels::CutlassFpAIntBGemmRunnerInterface>, tensorrt_llm::plugins::GemmIdCore, tensorrt_llm::plugins::GemmIdCoreHash>::freeTmpData() + 106
3       0x7f72fcdcd6b9 tensorrt_llm::plugins::GemmPluginProfiler<tensorrt_llm::cutlass_extensions::CutlassGemmConfig, std::shared_ptr<tensorrt_llm::kernels::cutlass_kernels::CutlassFpAIntBGemmRunnerInterface>, tensorrt_llm::plugins::GemmIdCore, tensorrt_llm::plugins::GemmIdCoreHash>::profileTactics(std::shared_ptr<tensorrt_llm::kernels::cutlass_kernels::CutlassFpAIntBGemmRunnerInterface> const&, nvinfer1::DataType const&, tensorrt_llm::plugins::GemmDims const&, tensorrt_llm::plugins::GemmIdCore const&) + 1113
4       0x7f72fcd9ed9d tensorrt_llm::plugins::WeightOnlyGroupwiseQuantMatmulPlugin::initialize() + 13
hijkzzz commented 3 months ago

Could you try pip install tensorrt_llm==0.11.0.dev2024061100 first? Or you can try pip install tensorrt_llm==0.11.0.dev2024061800 tomorrow. Thanks

hijkzzz commented 3 months ago

It works well using the latest version

| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:01:00.0 Off |                    0 |
| N/A   31C    P8             32W /  350W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A

python examples/quantization/quantize.py --model_dir /home/scratch.trt_llm_data/llm-models/llama-models-v3/llama-v3-8b-instruct-hf/ --dtype float16 --qformat int4_awq --awq_block_size 128 --output_dir ./tmp/llama3-8b-awq --calib_size 32; trtllm-build    --checkpoint_dir ./tmp/llama3-8b-awq --output_dir ./tmp/llama3-8b-awq-engine --gemm_plugin float16 --gpt_attention_plugin float16 --context_fmha enable --remove_input_padding enable --paged_kv_cache enable --max_input_len 3000 --max_output_len 3000; python examples/run.py --max_output_len=500 --tokenizer_dir=/home/scratch.trt_llm_data/llm-models/llama-models-v3/llama-v3-8b-instruct-hf/  --engine_dir=./tmp/llama3-8b-awq-engine

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[06/18/2024-10:05:23] [TRT-LLM] [I] Load engine takes: 34.90785765647888 sec
Input [Text 0]: "<|begin_of_text|>Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: " painter and sculptor before turning to photography. He began his career in the 1920s, working for various magazines and newspapers, and quickly gained a reputation for his innovative and expressive style. Soyer's photographs often featured everyday life, landscapes, and still-life compositions, and were characterized by their use of light, texture, and composition. He was also known for his portraits of famous people, including artists, writers, and musicians. Soyer's work was widely exhibited and published, and he is considered one of the most important French photographers of the 20th century. (Source: Getty Museum) [more]
...
Hongbosherlock commented 3 months ago

Could you try pip install tensorrt_llm==0.11.0.dev2024061100 first? Or you can try pip install tensorrt_llm==0.11.0.dev2024061800 tomorrow. Thanks

sucessfully installed new version, image

but got errors when running:

Traceback (most recent call last):
  File "/TensorRT-LLM/examples/run.py", line 23, in <module>
    from utils import (DEFAULT_HF_MODEL_DIRS, DEFAULT_PROMPT_TEMPLATES,
  File "/TensorRT-LLM/examples/utils.py", line 23, in <module>
    from tensorrt_llm.builder import get_engine_version
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/__init__.py", line 32, in <module>
    import tensorrt_llm.functional as functional
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 25, in <module>
    import tensorrt as trt
  File "/usr/local/lib/python3.10/dist-packages/tensorrt/__init__.py", line 18, in <module>
    from tensorrt_bindings import *
ModuleNotFoundError: No module named 'tensorrt_bindings'
hijkzzz commented 3 months ago

Could you try pip install tensorrt_llm==0.11.0.dev2024061100 first? Or you can try pip install tensorrt_llm==0.11.0.dev2024061800 tomorrow. Thanks

sucessfully installed new version, image

but got errors when running:

Traceback (most recent call last):
  File "/TensorRT-LLM/examples/run.py", line 23, in <module>
    from utils import (DEFAULT_HF_MODEL_DIRS, DEFAULT_PROMPT_TEMPLATES,
  File "/TensorRT-LLM/examples/utils.py", line 23, in <module>
    from tensorrt_llm.builder import get_engine_version
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/__init__.py", line 32, in <module>
    import tensorrt_llm.functional as functional
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 25, in <module>
    import tensorrt as trt
  File "/usr/local/lib/python3.10/dist-packages/tensorrt/__init__.py", line 18, in <module>
    from tensorrt_bindings import *
ModuleNotFoundError: No module named 'tensorrt_bindings'

Please try the container: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 Or build the container using make -C docker release_build

Hongbosherlock commented 3 months ago

Output [Text 0 Beam 0]: " painter and sculptor before turning to photography. He began his career in the 1920s, working for various magazines and newspapers, and quickly gained a reputation for his innovative and expressive style. Soyer's photographs often featured everyday life, landscapes, and still-life compositions, and were characterized by their use of light, texture, and composition. He was also known for his portraits of famous people, including artists, writers, and musicians. Soyer's work was widely exhibited and published, and he is considered one of the most important French photographers of the 20th century. (Source: Getty Museum) [more]

thanks, it works for me now.

Hongbosherlock commented 3 months ago

It works well using the latest version

| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:01:00.0 Off |                    0 |
| N/A   31C    P8             32W /  350W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A

python examples/quantization/quantize.py --model_dir /home/scratch.trt_llm_data/llm-models/llama-models-v3/llama-v3-8b-instruct-hf/ --dtype float16 --qformat int4_awq --awq_block_size 128 --output_dir ./tmp/llama3-8b-awq --calib_size 32; trtllm-build    --checkpoint_dir ./tmp/llama3-8b-awq --output_dir ./tmp/llama3-8b-awq-engine --gemm_plugin float16 --gpt_attention_plugin float16 --context_fmha enable --remove_input_padding enable --paged_kv_cache enable --max_input_len 3000 --max_output_len 3000; python examples/run.py --max_output_len=500 --tokenizer_dir=/home/scratch.trt_llm_data/llm-models/llama-models-v3/llama-v3-8b-instruct-hf/  --engine_dir=./tmp/llama3-8b-awq-engine

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[06/18/2024-10:05:23] [TRT-LLM] [I] Load engine takes: 34.90785765647888 sec
Input [Text 0]: "<|begin_of_text|>Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: " painter and sculptor before turning to photography. He began his career in the 1920s, working for various magazines and newspapers, and quickly gained a reputation for his innovative and expressive style. Soyer's photographs often featured everyday life, landscapes, and still-life compositions, and were characterized by their use of light, texture, and composition. He was also known for his portraits of famous people, including artists, writers, and musicians. Soyer's work was widely exhibited and published, and he is considered one of the most important French photographers of the 20th century. (Source: Getty Museum) [more]
...

hi @hijkzzz, can you run the benchmark successfully on L40s? When I run:

./benchmarks/gptSessionBenchmark \
    --engine_dir "/target/model/trt_engines/w4a8_AWQ/1-gpu/" \
    --batch_size "1" \
    --input_output_len "60,20"

I got errors:

[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: key_size <= remaining_buffer_size (/target/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/cubinObjRegistry.h:49)
1       0x5584e5f4b373 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7f0e64a684b7 tensorrt_llm::kernels::jit::CubinObjRegistryTemplate<tensorrt_llm::kernels::XQAKernelFullHashKey, tensorrt_llm::kernels::XQAKernelFullHasher>::CubinObjRegistryTemplate(void const*, unsigned long) + 1639
3       0x7f0e64a674f2 tensorrt_llm::kernels::DecoderXQARunner::Resource::Resource(void const*, unsigned long) + 50
4       0x7f0e9ad778f9 tensorrt_llm::plugins::GPTAttentionPluginCommon::GPTAttentionPluginCommon(void const*, unsigned long) + 873
5       0x7f0e9ad960d3 tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(void const*, unsigned long) + 19
6       0x7f0e9ad96152 tensorrt_llm::plugins::GPTAttentionPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 50
7       0x7f0e1493f102 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x1066102) [0x7f0e1493f102]
8       0x7f0e1493a1de /usr/local/tensorrt/lib/libnvinfer.so.10(+0x10611de) [0x7f0e1493a1de]
9       0x7f0e148b5177 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xfdc177) [0x7f0e148b5177]
10      0x7f0e148b33fe /usr/local/tensorrt/lib/libnvinfer.so.10(+0xfda3fe) [0x7f0e148b33fe]
11      0x7f0e148cbf27 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xff2f27) [0x7f0e148cbf27]
12      0x7f0e148cee7d /usr/local/tensorrt/lib/libnvinfer.so.10(+0xff5e7d) [0x7f0e148cee7d]
13      0x7f0e148cf3b4 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xff63b4) [0x7f0e148cf3b4]
14      0x7f0e148fe64f /usr/local/tensorrt/lib/libnvinfer.so.10(+0x102564f) [0x7f0e148fe64f]
15      0x7f0e148ff3f5 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x10263f5) [0x7f0e148ff3f5]
16      0x7f0e148ff489 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x1026489) [0x7f0e148ff489]
17      0x7f0e666a8a68 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(void const*, unsigned long, float, nvinfer1::ILogger&) + 504
18      0x7f0e66653db6 tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, void const*, unsigned long, std::shared_ptr<nvinfer1::ILogger>) + 1126
19      0x5584e5f4fce0 ./benchmarks/gptSessionBenchmark(+0x1dce0) [0x5584e5f4fce0]
20      0x7f0e5f5edd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f0e5f5edd90]
21      0x7f0e5f5ede40 __libc_start_main + 128
22      0x5584e5f533c5 ./benchmarks/gptSessionBenchmark(+0x213c5) [0x5584e5f533c5]
nv-guomingz commented 3 months ago

@Hongbosherlock Base your crash log, I guess your're tried to run benchmark with w4a8_awq mode, right? If so, I managed to run benchmark on L40

./cpp/build/benchmarks/gptSessionBenchmark  --engine_dir examples/quantization/engine_outputs --batch_size "1" --input_output_len "60,20"

Benchmarking done. Iteration: 10, duration: 1.56 sec.
Latencies: [155.62, 155.81, 155.41, 155.40, 156.54, 155.34, 155.42, 155.83, 156.30, 155.64]
[BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 155.73 tokensPerSec 128.43 generation_time(ms) 145.62 generationTokensPerSec 137.35 gpu_peak_mem(gb) 43.57

Even for original int4_awq, I also could run the benchmark with below output

./cpp/build/benchmarks/gptSessionBenchmark  --engine_dir examples/quantization/engine_outputs --batch_size "1" --input_output_len "60,20"

Benchmarking done. Iteration: 10, duration: 1.57 sec.
Latencies: [156.85, 156.74, 156.75, 156.88, 156.70, 156.73, 156.82, 156.93, 156.74, 156.89]
[BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 156.80 tokensPerSec 127.55 generation_time(ms) 144.26 generationTokensPerSec 138.64 gpu_peak_mem(gb) 43.61
geraldstanje commented 3 months ago

hi, what nvidia docker image uses latest TensorRT-LLM version: 0.11.0.dev ?