-
We want to deploy https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-bnb-4bit which is 4-bit quantized version of llama-3.2-1B model. It is quantized using bitsandbytes. Can we deploy this using ten…
-
We want to deploy https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-bnb-4bit which is 4-bit quantized version of llama-3.2-1B model. It is quantized using bitsandbytes. Can we deploy this using ten…
-
**Describe the bug**
OpenAI API endpoint is "/v1/chat/completions", but OVMS endpoint is "/v3/chat/completions".
most of existing application doesn't allow user to modify the prefix “**V1**” to "**…
-
For functional testing of features relying on LLM calls or LLM tasks, the main challenge is to be able to test with a stubbed LLM environment to be able to control the output of those calls or tasks.
…
-
**Is your feature request related to a problem? Please describe.**
There is an overhead of creating new threads when using streaming response feature.
This drogon example demonstrates it very well: …
-
### 🚀 The feature, motivation and pitch
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving
look so cool
### Alternatives
_No response_
### Additional context
_No response_
-
# OPEA Inference Microservices Integration for LangChain
This RFC proposes the integration of OPEA inference microservices (from GenAIComps) into LangChain [extensible to other frameworks], enabli…
-
Hi @zeitlings,
I love the workflow; I tried it on an M1 Macbook Air with 8 GB of RAM. As you can imagine, it completely sucks. I also have a server on which I sometimes play around with larger LLMs.…
-
intelanalytics/ipex-llm-serving-cpu:latest
-
How should we host the wiki + API backend?
Choices I see right now:
- Coolify self-hosting on a Hetzner remote machine
- Render
- Vercel
- Fly.io
- Others?
We initially thought supabase cou…