[BUG] fp8 run failed on L40s

Why is that?

commit id: b57221b764bc579cbb2490154916a871f620e2c4

the convert command

python build.py --model_dir /data/weilong.yu/vicuna-13b-v1.5/ \
                --quantized_fp8_model_path ./quantized_fp8/llama_tp1_rank0.npz \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./tmp/llama/70B/trt_engines/fp8/2-gpu/ \
                --remove_input_padding \
                --enable_context_fmha \
                --enable_fp8 \
                --fp8_kv_cache \
                --strongly_typed \
                --world_size 2 \
                --tp_size 2 \
                --parallel_build \
                --use_inflight_batching --paged_kv_cache \
                --max_batch_size $1

The error log when running inference

mpirun -n 2 --allow-run-as-root benchmarks/gptSessionBenchmark --input_output_len "128;26" --batch_size 32 --model llama --engine_dir ../../examples/llama/tmp/llama/70B/trt_engines/fp8/2-gpu/[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: Tensor 'kv_cache_block_pointers_0' has invalid shape (32, 2, 257), expected (-1, 2, -1) (/data/TRT-LLM/TensorRT-LLM/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:149)
1       0x55753ad94005 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x7fe6a1ca769f tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&) + 1823
3       0x7fe6a1c6325a tensorrt_llm::runtime::GptSession::executeContextStep(std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<int, std::allocator<int> > const&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const*) + 874
4       0x7fe6a1c64582 tensorrt_llm::runtime::GptSession::generateBatched(std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, tensorrt_llm::runtime::SamplingConfig const&, std::function<void (int, bool)> const&) + 3106
5       0x7fe6a1c65fb3 tensorrt_llm::runtime::GptSession::generate(tensorrt_llm::runtime::GenerationOutput&, tensorrt_llm::runtime::GenerationInput const&, tensorrt_llm::runtime::SamplingConfig const&) + 3107
6       0x55753ad99a26 benchmarks/gptSessionBenchmark(+0x1aa26) [0x55753ad99a26]
7       0x7fe6883fbd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fe6883fbd90]
8       0x7fe6883fbe40 __libc_start_main + 128
9       0x55753ad9b975 benchmarks/gptSessionBenchmark(+0x1c975) [0x55753ad9b975]
[TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: Tensor 'kv_cache_block_pointers_0' has invalid shape (32, 2, 256), expected (-1, 2, -1) (/data/TRT-LLM/TensorRT-LLM/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:149)
1       0x55f9fb709005 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x7f400487869f tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&) + 1823
3       0x7f400483425a tensorrt_llm::runtime::GptSession::executeContextStep(std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<int, std::allocator<int> > const&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const*) + 874
4       0x7f4004835582 tensorrt_llm::runtime::GptSession::generateBatched(std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, tensorrt_llm::runtime::SamplingConfig const&, std::function<void (int, bool)> const&) + 3106
5       0x7f4004836fb3 tensorrt_llm::runtime::GptSession::generate(tensorrt_llm::runtime::GenerationOutput&, tensorrt_llm::runtime::GenerationInput const&, tensorrt_llm::runtime::SamplingConfig const&) + 3107
6       0x55f9fb70ea26 benchmarks/gptSessionBenchmark(+0x1aa26) [0x55f9fb70ea26]
7       0x7f3feafccd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f3feafccd90]
8       0x7f3feafcce40 __libc_start_main + 128
9       0x55f9fb710975 benchmarks/gptSessionBenchmark(+0x1c975) [0x55f9fb710975]
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[57264,1],0]
  Exit code:    1
--------------------------------------------------------------------------

NVIDIA / TensorRT-LLM

[BUG] fp8 run failed on L40s #959