How to rewrite this kernel without referencing the implementation of cutlass

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

8.74k stars 1k forks source link

How to rewrite this kernel without referencing the implementation of cutlass #2396

Closed zhink closed 2 weeks ago

zhink commented 3 weeks ago

In this kernel cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/cudaCoreGemm.cu

How to rewrite this kernel without referencing the implementation of cutlass

hello-11 commented 3 weeks ago

@zhink Thanks for your interest in rewriting the kernel. Could you provide more details on why you want to rewrite it and what function you want to implement?

zhink commented 3 weeks ago

Beacuse the project is not easy to reference cutlass3；intput is fp8 e4m3 or e5m2,output is bf16 of float16. If cutlass is not necessary, please provide guidance on how to rewrite it.

hello-11 commented 3 weeks ago

@zhink I think this kernel does not use cutlass.

zhink commented 3 weeks ago

but use cutlass::NumericArrayConverter and Converter::convert

chuangz0 commented 3 weeks ago

You can implement numericArrayConverter yourself with reference to cutlass, which is not complicated.