-
### 🚀 The feature, motivation and pitch
Apparently outperforms Mixtral at a smaller size. Longer context length and multilingual.
https://github.com/mistralai/mistral-inference/#deployment for Docke…
-
### 🚀 The feature, motivation and pitch
**TLDR:** We would like an option we can enable to continuously stream the `UsageInfo` when using the streaming completions API. This solves a number of "acc…
-
### 🚀 The feature, motivation and pitch
Speculative decoding allows emitting multiple tokens per sequence by speculating future tokens, scoring their likelihood using the LLM, and then accepting each…
-
### Your current environment
Idk how to run it inside a docker
### 🐛 Describe the bug
Simply run the following command
`docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.ca…
-
## Overview
Speculative decoding allows a speedup for memory-bound LLMs by using a fast proposal method to propose tokens that are verified in a single forward pass by the larger LLM. Papers report 2…
-
### 🚀 The feature, motivation and pitch
Please consider adding support for GPTQ and AWQ quantized Mixtral models.
I guess that after #4012 it's technically possible.
### Alternatives
_No r…
-
代码如下:
```
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
max_model_len, tp_size = 131072, 1
model_name = "/models/codegeex4-all-9b"
tokenizer = AutoTokenizer.from_pr…
-
## 🚀 Feature
Please add Medusa decoding in mlc-llm in C++, we urgently needed it to speedup LLM decoding on mobile device.
refers to: https://github.com/FasterDecoding/Medusa/tree/main
Medusa adds …
-
### Your current environment
```text
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
…
-
## Function Calling
- Frontend
- Add `tools` argument in `sgl.gen`. See also guidance [tools](https://github.com/guidance-ai/guidance/blob/d1bbe1c698cbb201f89556d71193993e78c0686b/README.md?plai…