NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.37k stars 794 forks source link

TensorRT-LLM Requests #632

Open ncomly-nvidia opened 6 months ago

ncomly-nvidia commented 6 months ago

Hi all, this issue will track the feature requests you've made to TensorRT-LLM & provide a place to see what TRT-LLM is currently working on.

Last update: Jan 14th, 2024 🚀 = in development

Models

Decoder Only

Encoder / Encoder-Decoder

Multi-Modal

Other

Features & Optimizations

KV Cache

Quantization

Sampling

Workflow

Front-ends

Integrations

Usage / Installation

Platform Support

teis-e commented 2 months ago

Please add CohereAI!!

CohereForAI/c4ai-command-r-plus

EwoutH commented 2 months ago

Llama 3 would be great (both 8B and 70B): https://github.com/NVIDIA/TensorRT-LLM/issues/1470

Maybe quantized to 8 or even 4 bit.

StephennFernandes commented 2 months ago

currently llama 3 throws a bunch of errors converting to TensorRT LLM

any ideal about the support for llama 3

EwoutH commented 2 months ago

Phi-3-mini should be amazing! Such a small 3.8B model could run quantized on many GPUs, with as little as 4GB VRAM.

oscarbg commented 1 month ago

+1 for Phi-3

user-0a commented 1 month ago

+1 for Command R Plus!

CohereForAI/c4ai-command-r-plus

khan-yin commented 1 week ago

hello @ncomly-nvidia, I am a student interested in the project! I want to ask if there are any good-first-issue feature request for Features & Optimizations recently? 🤣

chenpinganan commented 19 hours ago

+1 for OpenBMB/MiniCPM-V-2