-
```
(text-generation-inference) root@C.10294313:~/tgi_test/text-generation-inference$ text-generation-launcher
2024-04-29T11:11:11.331114Z INFO text_generation_launcher: Args { model_id: "bigscie…
-
I'm guessing prefix cache is stored in the GPU VRAM. I'm wondering whether it's possible to allocate a percentage of system RAM to store prefix cache? Or would that generally be too slow? I.e. faster …
-
- [x] I have read and agree to the [contributing guidelines](https://github.com/griptape-ai/griptape#contributing).
Hello, I'm trying to connect a locally hosted LLM to to a prompt engine, we are …
-
I am seeing the below error about max_tokens when I run benchmark_serving.py for HuggingFace TGI. Is there anything else I should be doing?
I started the server with: `./launch_tgi_server.sh facebo…
-
### System Info
i was trying to run CohereForAI/c4ai-command-r-v01 with these commands
model= CohereForAI/c4ai-command-r-v01
volume=$PWD/data # share a volume with the Docker container to avoid …
-
### Bug Report
PR: Adding yarn support
https://github.com/huggingface/text-generation-inference/pull/1099
We can find 'yarn' for rope_scaling type.
`elif rope_scaling["type"] == "yarn":`
But, t…
-
Now that we can load GPTQ files that haven't been quantized by TGI's quantization script, I thought I'd do a set of tests to see which formats work and which don't. I'm using https://huggingface.co/Th…
-
### System Info
'details' in /v1/chat/completions endpoint missing
This works:
```
stream_url ="localhost:8000/generate_stream"
payload = {
"inputs": prompt,
"parameters": {
…
-
I've been trying to deploy the new LLaVA-NeXT with Sglang on Modal but not sure why I'm getting "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tun…
-
### Describe the bug
I try to run one of TheBloke's quantized models on an A100 40GB. It is not one of the most recent models
### To reproduce
```
openllm start llama --model-id TheBloke…