NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.74k stars 1k forks source link

FP8 run failed on L40s #961

Open sleepwalker2017 opened 10 months ago

sleepwalker2017 commented 10 months ago

System Info

CPU x86_64

GPU L40s

TensorRT branch: main
commid id:b57221b764bc579cbb2490154916a871f620e2c4

CUDA: | NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.3 |

Who can help?

No response

Information

Tasks

Reproduction

  1. build engine using Vicuna 13B V15
    
    python ../quantization/quantize.py --model_dir /data/vicuna-13b-v1.5/ \
                                         --dtype float16 \
                                         --qformat fp8 \
                                         --export_path ./quantized_fp8 \
                                         --calib_size 512 \

python build.py --model_dir /data/weilong.yu/vicuna-13b-v1.5/ \ --dtype float16 \ --use_gpt_attention_plugin float16 \ --use_gemm_plugin float16 \ --output_dir ./tmp/llama/13B/trt_engines/fp16/$2-gpu/ \ --max_batch_size $1 \ --tp_size $2 \ --world_size $2 --parallel_build \ --use_inflight_batching \ --remove_input_padding \ --paged_kv_cache \ --enable_context_fmha


2. Run the gptSessionManager

mpirun -n 2 --allow-run-as-root benchmarks/gptSessionBenchmark --input_output_len "128;26" --batch_size 32 --model llama --engine_dir ../../examples/llama/tmp/llama/70B/trt_engines/fp8/2-gpu/


### Expected behavior

it should start inference benchmark.

### actual behavior

mpirun -n 2 --allow-run-as-root benchmarks/gptSessionBenchmark --input_output_len "128;26" --batch_size 32 --model llama --engine_dir ../../examples/llama/tmp/llama/70B/trt_engines/fp8/2-gpu/[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.) [TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: Tensor 'kv_cache_block_pointers_0' has invalid shape (32, 2, 257), expected (-1, 2, -1) (/data/TRT-LLM/TensorRT-LLM/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:149) 1 0x55753ad94005 tensorrt_llm::common::throwRuntimeError(char const, int, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 100 2 0x7fe6a1ca769f tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::cxx11::basic_string<char, std::char_traits, std::allocator >, std::shared_ptr, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, std::shared_ptr > > > const&) + 1823 3 0x7fe6a1c6325a tensorrt_llm::runtime::GptSession::executeContextStep(std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator > const&, std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator >&, std::vector<int, std::allocator > const&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const) + 874 4 0x7fe6a1c64582 tensorrt_llm::runtime::GptSession::generateBatched(std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator >&, std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator > const&, tensorrt_llm::runtime::SamplingConfig const&, std::function<void (int, bool)> const&) + 3106 5 0x7fe6a1c65fb3 tensorrt_llm::runtime::GptSession::generate(tensorrt_llm::runtime::GenerationOutput&, tensorrt_llm::runtime::GenerationInput const&, tensorrt_llm::runtime::SamplingConfig const&) + 3107 6 0x55753ad99a26 benchmarks/gptSessionBenchmark(+0x1aa26) [0x55753ad99a26] 7 0x7fe6883fbd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fe6883fbd90] 8 0x7fe6883fbe40 libc_start_main + 128 9 0x55753ad9b975 benchmarks/gptSessionBenchmark(+0x1c975) [0x55753ad9b975] [TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: Tensor 'kv_cache_block_pointers_0' has invalid shape (32, 2, 256), expected (-1, 2, -1) (/data/TRT-LLM/TensorRT-LLM/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:149) 1 0x55f9fb709005 tensorrt_llm::common::throwRuntimeError(char const*, int, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 100 2 0x7f400487869f tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::cxx11::basic_string<char, std::char_traits, std::allocator >, std::shared_ptr, std::hash<std::cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, std::shared_ptr > > > const&) + 1823 3 0x7f400483425a tensorrt_llm::runtime::GptSession::executeContextStep(std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator > const&, std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator >&, std::vector<int, std::allocator > const&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const*) + 874 4 0x7f4004835582 tensorrt_llm::runtime::GptSession::generateBatched(std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator >&, std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator > const&, tensorrt_llm::runtime::SamplingConfig const&, std::function<void (int, bool)> const&) + 3106 5 0x7f4004836fb3 tensorrt_llm::runtime::GptSession::generate(tensorrt_llm::runtime::GenerationOutput&, tensorrt_llm::runtime::GenerationInput const&, tensorrt_llm::runtime::SamplingConfig const&) + 3107 6 0x55f9fb70ea26 benchmarks/gptSessionBenchmark(+0x1aa26) [0x55f9fb70ea26] 7 0x7f3feafccd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f3feafccd90] 8 0x7f3feafcce40 __libc_start_main + 128 9 0x55f9fb710975 benchmarks/gptSessionBenchmark(+0x1c975) [0x55f9fb710975]

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[57264,1],0] Exit code: 1



### additional notes

none
sleepwalker2017 commented 10 months ago

@byshiue hello, I modified the issue, please have a look. Thank you.

nv-guomingz commented 1 week ago

Hi we tried the latest code base and no issue found yet. Could u please try again?

And do u still have further issue or question now? If not, we'll close it soon.