-
Hi everyone,
I have the following setup (containers are on the same device):
- Container 1: Nvidia NIM (openai-compatible) with Llama3 8B Instruct, port 8000;
- Container 2: chat-ui, port 3000.
…
-
### Model description
This model was released by Mistral [here](https://mistral.ai/news/mistral-nemo/), and is available on HuggingFace [here](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407)…
-
### Proposal to improve performance
Test new feature medusa speculative sampling with [vllm v0.5.2](vllm-openai:v0.5.2).
After using Medusa speculative sampling, the performance dropped significantl…
-
I try to run inside the latest image, but after the model warmup, it just died with no error.
I was trying to run this
aviary run --model ~/models/continuous_batching/mosaicml--mpt-7b-chat.yaml
the…
-
I was wondering if Flash Attention supports doing prefill in chunks.
And if so if there is a high level function that can be used for that.
E.g. TGI uses `varlen_fwd` but from what I understand this…
-
Based on practical tests, deploying omost-llama-3-8b on an A100 using torch==2.3.0+cu118, vllm==0.5.0.post1+cu118, and xformers==0.0.26.post1+cu118 works well. if want to speed up the process, can ref…
-
DJL does not support (or has not documented support) for FP8 quantization ([docs](https://demodocs.djl.ai/docs/serving/serving/docs/lmi/user_guides/trt_llm_user_guide.html#quantization-support)).
…
-
# Weekly GitHub Trending! (2024/10/28 ~ 2024/11/04)
## Python trending 6repo's
### [Skyvern-AI](https://github.com/Skyvern-AI) / [skyvern](https://github.com/Skyvern-AI/skyvern)
LLM とコンピューター ビジョンを使用して…
-
Hello Guys,
Could you guide me in the right direction to get the configuration of the Code Llama Instruct model right?
I have this config so far:
```
{
"name": "Code Llama",
"e…
-
i tried mistral & llama7b from ctransofrmer & getting this issue,is there any way to add support for this?
how can we implement it with websocket?
```
streaming_llm = CTransformers(model='T…