-
### System Info
Hi everyone, when trying to update from Llama 3 8B Instruct to Llama 3.1 8B Instruct, I noticed a crash:
```bash
Args {
model_id: "meta-llama/Meta-Llama-3.1-8B-Instruct",
…
-
https://github.com/huggingface/text-generation-inference
Main features of TGI are quite awesome. It woud be nice to make it additional inference implementation.
-
### Feature request
Would it be possible to build/publish an arm64 container image for the text-generation-inference? I would like to be able to run it on a NVIDIA GH200 which is an arm64-based syst…
-
### System Info
```bash
gpu=0
num_gpus=1
model=meta-llama/Meta-Llama-3.1-8B-Instruct
docker run -d \
--gpus "\"device=$gpu\"" \
--shm-size 16g \
-e HUGGING_FACE_HUB_TOKEN=$token \
-p 8082:80 …
-
### Your current environment
0.6.3.post1
### 4 🐛generation scenarios
There are at least 4 generation use cases in vLLM:
1. offline generate
2. offline chat
3. online completion (similar …
-
### System Info
I tried the following systems, both with the same exception:
- ghcr.io/huggingface/text-generation-inference:sha-6aebf44 locally with docker on nvidia rtx 3600
- ghcr.io/huggingface…
-
## Version
deepspeed: `0.13.4`
transformers: `4.38.1`
Python: `3.10`
Pytorch: `2.1.2+cu121`
CUDA: 12.1
## Error in Example (To reproduce)
Just simply run this script
https://github.com/micr…
-
命令如下
```
docker run --gpus 'all' --shm-size 1g -p 9090:80 -v $HOME/codeshell/CodeShell-7B-Chat:/data \
--env LOG_LEVEL="info,text_generation_router=debug" \
ghcr.nju.edu.cn/hugging…
-
I attempted to serve the original base model of **Llama 3.1** in 4-bit, both with and without setting `load_in_4bit`. Below are my observations.
When `load_in_4bit = True`:
The model throws the f…
-
### System Info
`
text-generation-launcher 2.1.0
`
### Information
- [X] Docker
- [X] The CLI directly
### Tasks
- [ ] An officially supported command
- [ ] My own modifications
### Reprod…