-
### Name and Version
bitnami/vllm 0.1.0
### What is the problem this feature will solve?
Add the helm chart for vllm - a high-throughput and memory-efficient inference and serving engine for …
-
### Search before asking
- [X] I have searched the Ultralytics YOLO [issues](https://github.com/ultralytics/ultralytics/issues) and [discussions](https://github.com/ultralytics/ultralytics/discussion…
-
### 🚀 The feature, motivation and pitch
DeepSeek-V2 design **MLA (Multi-head Latent Attention)**, which utilizes low-rank key-value union compression to eliminate the bottleneck of inference-time key…
-
RuntimeError: invalid vector subscript
[2024-10-22 15:13] 2024-10-22 15:13:07,066- root:393- ERROR- Traceback (most recent call last):
File "H:\sd\comfy-torch-2.1.2+cu118\execution.py", line 323…
-
See the preprint [here](https://openreview.net/forum?id=G1hjFDre0NF).
It will be useful for few-shot in-context learning scenarios, prefix-tuning finetuned models, and generally those LLM applicati…
-
I have just read your paper "LINA-SPEECH: GATED LINEAR ATTENTION IS A FAST AND PARAMETER-EFFICIENT LEARNER FOR TEXT-TO-SPEECH SYNTHESIS" and I must say, I am truly amazed by the effectiveness of your …
-
**Original article: Optimizing Deep Learning Inference on Embedded Systems Through Adaptive Model Selection**
**PDF URL: https://github.com/HitkoDev/energy-efficient-ml/blob/master/article.pdf**
*…
-
### **Adaptation for MacOS and Mobile Devices**
Given the model's relatively small parameter size and efficient performance, I was wondering if there are any plans to adapt it for MacOS devices with …
-
Functional discussion for this project.
[notebooks/llm-chatbot](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot)
Intel's official documentation: https://www…
-
I want to perform inference on quantized LLAMA (W8A16) on ARM-v9 (with SVE) using oneDNN. The LLAMA weights are per-group quantized.
Based on my understanding, I need to prepack the weights to redu…