-
### Describe the issue
Can I run "python -m vllm.entrypoints.openai.api_server" to load MInference capabilities in VLLM?
-
请问有做过ASF-YOLO的tensorrt的推理加速吗
-
## 🚀 Feature
Please add Lookahead Decoding in mlc-llm in C++, we needed it to speedup LLM decoding on **mobile device.**
refers to: https://github.com/hao-ai-lab/LookaheadDecoding
## Motivation
…
-
Thanks for the FOSS!
Suggestion for future possible backends runtimes: Vulkan, OpenCL, SYCL/OpenVino/intel GPU, AMD gpu/ROCm/HIP.
Vulkan and OpenCL both have the possibility of being very port…
-
### Describe the issue
FP16 model inference is slower compared to FP32. Does FP16 inference require additional configuration or just need to convert the model to FP16
### To reproduce
convert onnx …
-
**LocalAI version:**
Using Docker image:
`localai/localai:latest-aio-gpu-hipblas`
**Environment, CPU architecture, OS, and Version:**
- Ubuntu 22.04
- Xeon X5570 [Specs](https://ark.intel.c…
-
Hi I was wondering if there was any support for CPU inferences. The sample script from hubconf.py doesn't run even if after all the code instructing tensors and models to move to cuda were removed per…
-
```
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor, AutoConfig
from qwen_vl_utils import process_vision_info
import torch
model_name = "Qwen/Qwen2-VL-7B-I…
-
Good morning(or afternoon/ evening)!
There is a methodology called **self speculative decoding** among the techniques to enhance the speed of LLM inference. Would it be possible to implement this …
-
## SHARK Studio Roadmap
This project establishes and tracks a plan for phased releases of the SHARK Studio WebUI.
There are three objectives of this roadmap:
- Define product features, support…