-
**Describe the solution you'd like**
With the recent [TensorRT-LLM support for Whipser](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/whisper), and now that PyTriton supports TensorRT-LLM…
-
### 🚀 The feature, motivation and pitch
there's a new DP shard strategy which is more flexible and general, see more detail at https://arxiv.org/abs/2311.00257 AMSP: Reducing Communication Overhead o…
-
### Summary of the Enhanced LLM Inference System
**Objective**: To create a robust, transparent, and efficient system for large language model (LLM) inference using CUDA, ensuring reproducibility, qu…
-
Hello team,
We typically use `gather_all_token_logits` to collect the logit tensors for post-processing. Especially for large vocabulary sizes (128 000) this can require a lot of GPU memory. For ex…
-
**Is your feature request related to a problem? Please describe.**
I am often frustrated by the limitation of being able to use only a single QueryTransformer at a time. This constraint makes it ch…
-
Consistency Large Language Models: A Family of Efficient Parallel Decoders
https://hao-ai-lab.github.io/blogs/cllm/
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
https://arxiv.or…
-
-
I just wrote a layout engine library with the help of LLM, which is much more efficient for simple HTML than wkhtml: https://github.com/html2any/layout. You can try it
Support flex layout,css,page sp…
-
**Why**
To streamline user interactions with the language learning model (LLM) in the chat application, users will be able to quickly select from a variety of predefined prompt templates. This featur…
-
**What would you like to be added/modified:**
1. Build a collaborative code intelligent agent alignment dataset for LLMs:
- The dataset should include behavioral trajectories, feedback, and i…