NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.34k stars 936 forks source link

Sparsity fp8 Llama-3-8b on RTX4090 has no speed improvement against dense one #1913

Closed lishicheng1996 closed 2 months ago

lishicheng1996 commented 3 months ago

Hi! I tried Sparsity fp8 Llama-3-8b on RTX4090, but doesn't get performance improvement. I checked the trt-llm build log, which shows that depite there are layers eligible to use sparse tactics, they are not chosen. image

I see sparsity example on H100 in the benchmark. I'm wodering why it doesn't work on 4090. And may I ask is there 4090 support plan in roadmap?

Thanks!

QiJune commented 3 months ago

@Tracin Could you please have a look? Thanks

Tracin commented 2 months ago

@lishicheng1996 Basiclly, this is due to less efficient of sparse kernel compared with dense one, so they are not chosen. Can you verify it on H100? I do not see any plan about FP8 sparsity on 4090.

lishicheng1996 commented 2 months ago

@lishicheng1996 Basiclly, this is due to less efficient of sparse kernel compared with dense one, so they are not chosen. Can you verify it on H100? I do not see any plan about FP8 sparsity on 4090.

Thank you!