-
RWKV 5 supported vLLM?LMdeploy?TGI?Fastllm?FasterTransformer?
What should I do to get the inference performance?like throughput, token latency and latency?
-
The ReadMe mentions the ability to serve at scale with continuous batching.
Even if not vLLM or TGI, is there some work that someone could point me to on this?
Is there any functioning packaging…
-
I experienced failure using TGI-NeuronX DLC on ml.trn1.32xlarge for llama-2 70B.
* I am able to compile successfully on inf2.48xlarge with context length of 2K, batch size of 4, and TP of 24 and furt…
-
### What behavior of the library made you think about the improvement?
As of now Medusa is generating hallucinations as the speculative multihead is not supporting the outline decoding grammar.
…
-
**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
HuggingFace TGI is a standard way to…
-
### System Info
image: text-generation-inference:sha-bf3c813-rocm
GPU: AMD MI250
TGI args: --dtype float16 --model-id tiiuae/falcon-11B
PS. tested on meta-llama/Llama-2-7b-hf, no issues
###…
-
Hi, it would we awesome to have mermaid support. I'm not sure if this would be helpful to others but I can look into adding support in the future (unless someone else is working on this sooner)
-
Hi everyone,
I have the following setup (containers are on the same device):
- Container 1: Nvidia NIM (openai-compatible) with Llama3 8B Instruct, port 8000;
- Container 2: chat-ui, port 3000.
…
-
### System Info
Running a TGI 2.0.3 docker on a 8 NVIDIA_L4 VM.
Command:
```bash
MODEL=codellama/CodeLlama-70b-Python-hf
docker run \
-m 320G \
--shm-size=40G \
-e NVIDIA_VISIBLE_DEVIC…
-
### System Info
latest TGI docker image
### Information
- [X] Docker
- [ ] The CLI directly
### Tasks
- [X] An officially supported command
- [ ] My own modifications
### Reproduction
1. Use …