tgi Search Results - Githubissues

1000+ results
for tgi

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

huggingface/chat-ui #1311

400 (no body) trying to reach openai compatible server

Hi everyone, I have the following setup (containers are on the same device): - Container 1: Nvidia NIM (openai-compatible) with Llama3 8B Instruct, port 8000; - Container 2: chat-ui, port 3000. …

edesalve updated 3 months ago
2
huggingface/text-generation-inference #2252

Add support for Mistral-Nemo

### Model description This model was released by Mistral [here](https://mistral.ai/news/mistral-nemo/), and is available on HuggingFace [here](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407)…

shaltielshmid updated 3 months ago
5
vllm-project/vllm #6777

[Performance]: Medusa SD have poor performance than baseli…

### Proposal to improve performance Test new feature medusa speculative sampling with [vllm v0.5.2](vllm-openai:v0.5.2). After using Medusa speculative sampling, the performance dropped significantl…

deepindeed2022 updated 3 weeks ago
6
ray-project/ray-llm #61

issue with run locally

I try to run inside the latest image, but after the model warmup, it just died with no error. I was trying to run this aviary run --model ~/models/continuous_batching/mosaicml--mpt-7b-chat.yaml the…

omlomloml updated 1 year ago
1
Dao-AILab/flash-attention #826

Support for Dynamic SplitFuse

I was wondering if Flash Attention supports doing prefill in chunks. And if so if there is a high level function that can be used for that. E.g. TGI uses `varlen_fwd` but from what I understand this…

AGIEventHorizon updated 8 months ago
1
lllyasviel/Omost #100

Using vLLM to deploy LLM as an API to accelerate inference

Based on practical tests, deploying omost-llama-3-8b on an A100 using torch==2.3.0+cu118, vllm==0.5.0.post1+cu118, and xformers==0.0.26.post1+cu118 works well. if want to speed up the process, can ref…

fx-hit updated 2 months ago
3
deepjavalibrary/djl #3145

Support for FP8 quantization with TensorRT-LLM

DJL does not support (or has not documented support) for FP8 quantization ([docs](https://demodocs.djl.ai/docs/serving/serving/docs/lmi/user_guides/trt_llm_user_guide.html#quantization-support)). …

nathan-az updated 6 months ago
2
ivgtr/github-weeklyTrends #369

Weekly GitHub Trending! (2024/10/28 ~ 2024/11/04)

# Weekly GitHub Trending! (2024/10/28 ~ 2024/11/04) ## Python trending 6repo's ### [Skyvern-AI](https://github.com/Skyvern-AI) / [skyvern](https://github.com/Skyvern-AI/skyvern) LLM とコンピュータービジョンを使用して…

ivgtr updated 4 days ago
4
huggingface/chat-ui #417

CodeLlama Instruct Configuration

Hello Guys, Could you guide me in the right direction to get the configuration of the Code Llama Instruct model right? I have this config so far: ``` { "name": "Code Llama", "e…

schauppi updated 1 year ago
9
pors/langchain-chat-websockets #2

Async generation not implemented for this LLM.

i tried mistral & llama7b from ctransofrmer & getting this issue,is there any way to add support for this? how can we implement it with websocket? ``` streaming_llm = CTransformers(model='T…

akashAD98 updated 1 year ago
1

上一页 1...23 24 25 26 27 28 29...100 下一页

1000+ results for tgi

1000+ results
for tgi