NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.67k stars 990 forks source link

All of the activation values are zero in benchmark #844

Open leizhao1234 opened 10 months ago

leizhao1234 commented 10 months ago

When I was running the benchmark for Llama 70b, I found that all of the activation values are zero. ''' python build.py --model_dir /code/tensorrt_llm/models/Llama-2-70b-chat-hf/ --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4 --paged_kv_cache --use_inflight_batching --int8_kv_cache --output_dir ./tmp/llama/70B/trt_engines/fp16/1-gpu/

./gptSessionBenchmark --model llama --engine_dir /code/tensorrt_llm/models/tmp/llama/70B/trt_engines/fp16/1-gpu/ --batch_size "8" --input_output_len "1024,1" ''' I don't know what happens, and i think multiplication of matrices with all zeros can greatly affect performance.

byshiue commented 10 months ago

Could you share how do you print the activation values?

leizhao1234 commented 10 months ago
截屏2024-01-10 16 51 00
byshiue commented 10 months ago

I am afraid that you try printing a half number by a float data type in printf. Please cast them to float before printing.

nv-guomingz commented 1 day ago

Hi @leizhao1234 do u still have further issue or question now? If not, we'll close it soon.