NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.37k stars 796 forks source link

Run inference / do a forward pass on certain segments of LLM only during inference #1866

Open avianion opened 4 days ago

avianion commented 4 days ago

Many research papers add an additional lm_head or decoder_layer to an LLM.

What is the process in the C++ or pytorch runtime to selectively run a forward pass on inference only on a single layer or head of the model, as is common for example in Medusa decoding?

QiJune commented 2 days ago

Hi @avianion , could you please give some reference codes(in PyTorch) to show the forward pass? Thanks