-
This is a feature request to deploy Small Language Models (SLM) (3b or 1b). SLMs are improving quickly and are becoming good choice for narrowed scope usecases.
Examples can be TinyLlama, Minichat…
-
The idea is perhaps future-looking, but I'd like to bring it up for discussion.
## Motivations
* Reduce the GPU/NPU memory required for completing a use case (e.g. text2image).
* Reduce the mem…
-
Hello,
could we please have 13b and 7b models with the updated architecture that includes grouped query attention? A lot of people are running these models on machines with low memory and this woul…
-
Hi All,
Thank you for your amazing work!
Where can we find a list of models that support Structured JSON Generation? Do all the models support that?
We were able to find a list of models in the [HF…
-
### The Feature
Currently, you only support a small amount of `json` format models :
https://docs.litellm.ai/docs/completion/json_mode
### Motivation, pitch
I would need to be able to do the same …
-
The inference time is way to high, we should try to use a much smaller model from Ollama:
* dolphin-phi (3B uncensored dolphin model)
-
Hello,
In the supplemental information of the BioRxiv preprint (https://www.biorxiv.org/content/10.1101/2023.02.06.527280v2.supplementary-material) I read that over-prediction in small contigs (small…
-
Trying to quantise some flux models to lower the vram needs and I get that error.
```
(venv) C:\AI\llama.cpp\build>bin\Debug\llama-quantize.exe "C:\AI\ComfyUI_windows_portable\ComfyUI\models\chec…
-
With local deployment, the PRELOAD_MODELS config variable works perfectly :
```
PRELOAD_MODELS='["Systran/faster-whisper-medium.en", "Systran/faster-whisper-small.en"]' MAX_MODELS=2 uvicorn main:a…
-
We aim to implement a system that leverages distillation and quantization to create a "child" neural network by combining parameters from two "parent" neural networks. The child network should inherit…