NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.74k stars 1k forks source link

How to rewrite this kernel without referencing the implementation of cutlass #2396

Closed zhink closed 2 weeks ago

zhink commented 3 weeks ago

In this kernel cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/cudaCoreGemm.cu

How to rewrite this kernel without referencing the implementation of cutlass

hello-11 commented 3 weeks ago

@zhink Thanks for your interest in rewriting the kernel. Could you provide more details on why you want to rewrite it and what function you want to implement?

zhink commented 3 weeks ago

Beacuse the project is not easy to reference cutlass3;intput is fp8 e4m3 or e5m2,output is bf16 of float16. If cutlass is not necessary, please provide guidance on how to rewrite it.

hello-11 commented 3 weeks ago

@zhink I think this kernel does not use cutlass.

zhink commented 3 weeks ago

but use cutlass::NumericArrayConverter and Converter::convert

chuangz0 commented 3 weeks ago

You can implement numericArrayConverter yourself with reference to cutlass, which is not complicated.