efficient-llm-inference Search Results

oneapi-src/oneDNN #1788

GEMM API for efficient LLM inference with W8A16

I want to perform inference on quantized LLAMA (W8A16) on ARM-v9 (with SVE) using oneDNN. The LLAMA weights are per-group quantized. Based on my understanding, I need to prepack the weights to redu…

oleotiger updated 2 months ago

huggingface/speech-to-speech #104

Approach for enabling multi client connection

I'd like to explore the best approach for managing multi-client connections in both single and multi-GPU environments. Often, GPUs are underutilized by a single client, especially when smaller mode…

kdcyberdude updated 1 week ago

kubeedge/ianvs #126

Cloud-edge collaborative speculative decoding for LLM based …

- Description: - The autoregressive decoding mode of LLM determines that LLM can only be decoded serially, which limits its inference speed. Speculative decoding technique can be used to decode L…

hsj576 updated 1 month ago

vllm-project/vllm #7690

[Feature]: Overlap model weight loading and model prefill

### 🚀 The feature, motivation and pitch For LLM inference, requests per second(QPS) is not constant. It needs launch vllm engine on demand. For elastic instance, it's significance to reduce TTFT(Time…

candyzone updated 1 month ago

kubeedge/ianvs #96

Cloud-edge collaborative inference for LLM based on KubeEdge…

**What would you like to be added/modified**: This issue aims to build a cloud-edge collaborative inference framework for LLM on KubeEdge-Ianvs. Namely, it aims to help all cloud-edge LLM develop…

hsj576 updated 1 month ago

FunAudioLLM/CosyVoice #228

The llm model is support infer use `vllm` or use multi GPUs…

The llm module is support infer use `vllm` or use multi GPUs? If not, when will these features be implemented ?

Ox0400 updated 3 weeks ago

neonbjb/tortoise-tts #694

Batch Inference?

So we're having issues inferencing efficiently at scale, and of course we're processing the audio parts one by one as is default for inference, but is there any support for batch inference to speed th…

addytheyoung updated 4 months ago

data61/MP-SPDZ #1494

Report security issue in the truncpr protocols

Hello Keller, In the paper "Efficient 3PC for Binary Circuits with Application to Maliciously-Secure DNN Inference" (Usenix Security 2023), the authors point out that the truncpr protocol proposed …

GuopengLin updated 3 weeks ago

cncf/tag-runtime #175

Add CNAI Patterns/Blueprints

Add a section for patterns/blueprints under https://tag-runtime.cncf.io/wgs/cnaiwg/ Blueprints/patterns are: - Actionable/Hands-on - Encapsulates well-accommodated patterns of Cloud Native for…

zanetworker updated 2 months ago

withinmiaov/A-Survey-on-Mixture-of-Experts #3

How about add MeteoRA to your survey?

We propose [MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models](https://arxiv.org/pdf/2405.13053). Our proposed MeteoRA (Multiple-Tasks embedded LoRA) is a scalable and efficient framewor…

ParagonLight updated 2 months ago

690 results for efficient-llm-inference

690 results
for efficient-llm-inference