NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

--use_fp8_context_fmha enable option broken for tensorrtllm versions 0.13.0, 0.14.0 #2447

Closed avianion closed 1 week ago

avianion commented 1 week ago

Hello, I am trying to use fp8 context fmha for benchmarking the performance of the attention mechanism in fp8. but I am unable to get it to work.

First i have quantized llama 3.1 8b to fp8 using the PTQ in the repo, as follows, then I run the build script.

The below code works fine if you REMOVE this line --use_fp8_context_fmha enable \

I have tried with multiple combinations of build commands but --use_fp8_context_fmha enable \ doesnt seem to work.

huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --local-dir ./llama8b --exclude *original*

python3 /root/TensorRT-LLM/examples/quantization/quantize.py \
    --model_dir /root/llama8b \
    --dtype float16 \
    --qformat fp8 \
    --output_dir ./llama-8b-ckpt-fp8 \
    --calib_size 512

trtllm-build --checkpoint_dir ./llama-8b-ckpt-fp8 \
             --output_dir ./llama8b-engine \
             --gemm_plugin float16 \
             --gpt_attention_plugin float16 \
             --use_paged_context_fmha enable \
             --context_fmha enable \
             --use_fp8_context_fmha enable \
             --max_num_tokens 8192 \
             --max_seq_len 8192 \
             --max_batch_size 4

I am testing a basic example like this to see if use_fp8_context_fmha works.

So far I have tried on version 0.13.0, and version 0.14.0, and I receive this error when running the summarization test (also the run.py)

python3 /home/TensorRT-LLM/examples/summarize.py --engine_dir ./llama8b-engine \
                       --hf_model_dir  /home/llama8b \
                       --tokenizer_dir  /home/llama8b \
                       --test_trt_llm \
                       --data_type fp16 \
                       --batch_size 1 \
                       --output_len 400 \
                       --max_tokens_in_paged_kv_cache 100000 \
                       --kv_cache_free_gpu_memory_fraction 0.5
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.76 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 5.58 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 93.12 GiB, available: 82.98 GiB
[TensorRT-LLM][WARNING] Both freeGpuMemoryFraction (aka kv_cache_free_gpu_mem_fraction) and maxTokens (aka max_tokens_in_paged_kv_cache) are set (to 0.500000 and 100000, respectively). The smaller value will be used.
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 1563
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 128
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 12.21 GiB for max tokens in paged KV cache (100032).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[11/14/2024-14:20:40] [TRT-LLM] [I] Load engine takes: 2.463017225265503 sec
[h100:15728] *** Process received signal ***
[h100:15728] Signal: Segmentation fault (11)
[h100:15728] Signal code: Address not mapped (1)
[h100:15728] Failing at address: (nil)
[h100:15728] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x76f0b0642520]
[h100:15728] *** End of error message ***
Segmentation fault (core dumped)

Does this mean --use_fp8_context_fmha enable is broken and doesnt work overall?

Can someone help me fix this?

Thanks

avianion commented 1 week ago

@hello-11 @AdamzNV @ncomly-nvidia @pcastonguay any idea?