NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.29k stars 925 forks source link

Why skip top_p when top_k > 0? #1820

Closed ZhiqiJiang closed 3 months ago

ZhiqiJiang commented 3 months ago

https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/layers/topPSamplingLayer.cu:60 skipDecode[batchSlot] = k > 0; As shown in the code, top_p sampling will be skipped when top_k > 0. Why not top_p sampling follow top_k sampling?

ZhiqiJiang commented 3 months ago

I have found the answer in https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/layers/topKSamplingLayer.cu:55.

if (k > 0 && p == 0.0f)
{
    // This case corresponds to the old topk sampling, which is equivalent to
    // the old topk_topp sampling with topp=1.0f. TopKSamplingLayer and
    // TopKTopPSamplingLayer are now merged by TopKSamplingLayer. Thus, we
    // replace the case topk>0 and topp=0.0f by topk>0 and topp=1.0f for the
    // compatibility.
    p = 1.0f;
}