TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.76 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 5.58 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 93.12 GiB, available: 82.98 GiB
[TensorRT-LLM][WARNING] Both freeGpuMemoryFraction (aka kv_cache_free_gpu_mem_fraction) and maxTokens (aka max_tokens_in_paged_kv_cache) are set (to 0.500000 and 100000, respectively). The smaller value will be used.
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 1563
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 128
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 12.21 GiB for max tokens in paged KV cache (100032).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[11/14/2024-14:20:40] [TRT-LLM] [I] Load engine takes: 2.463017225265503 sec
[h100:15728] *** Process received signal ***
[h100:15728] Signal: Segmentation fault (11)
[h100:15728] Signal code: Address not mapped (1)
[h100:15728] Failing at address: (nil)
[h100:15728] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x76f0b0642520]
[h100:15728] *** End of error message ***
Segmentation fault (core dumped)
Does this mean --use_fp8_context_fmha enable is broken and doesnt work overall?
Hello, I am trying to use fp8 context fmha for benchmarking the performance of the attention mechanism in fp8. but I am unable to get it to work.
First i have quantized llama 3.1 8b to fp8 using the PTQ in the repo, as follows, then I run the build script.
The below code works fine if you REMOVE this line --use_fp8_context_fmha enable \
I have tried with multiple combinations of build commands but --use_fp8_context_fmha enable \ doesnt seem to work.
I am testing a basic example like this to see if use_fp8_context_fmha works.
So far I have tried on version 0.13.0, and version 0.14.0, and I receive this error when running the summarization test (also the run.py)
Does this mean --use_fp8_context_fmha enable is broken and doesnt work overall?
Can someone help me fix this?
Thanks