NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.57k stars 972 forks source link

[Feature Request] support YaRN request #792

Open kkr37 opened 10 months ago

kkr37 commented 10 months ago

Feature request Nous Research and EleutherAI have released the YaRN model, which comes in two versions with context sizes of 64k and 128k. This model utilizes RoFormer-style embeddings, distinguishing it from GPT-NeoX and GPT-J. It is built upon the foundation of the LLaMa 2 model, making it largely compatible with some minor adjustments required for optimal support.

Motivation The YaRN model's longer context length (up to 128k) is highly valuable for tasks involving extensive context, compared to the limited 4096 context length of the llama2 base model.

Other YaRN paper: YaRN: Efficient Context Window Extension of Large Language Models YaRN Code: YaRN Github

jesonxiang commented 3 weeks ago

not supported yet?

AdamzNV commented 5 days ago

As more and more new models enter the market, we have prepared comprehensive instructions for TRT-LLM developers on adapting to new models of interest. We encourage our community developers to expand the range of supported models, fostering an open ecosystem with rapid iterations.

Please try following these instructions and let us know if you encounter any issues during the adaptation process. We greatly appreciate your dedication.