-
I want to perform inference on quantized LLAMA (W8A16) on ARM-v9 (with SVE) using oneDNN. The LLAMA weights are per-group quantized.
Based on my understanding, I need to prepack the weights to redu…
-
I'd like to explore the best approach for managing multi-client connections in both single and multi-GPU environments.
Often, GPUs are underutilized by a single client, especially when smaller mode…
-
- Description:
- The autoregressive decoding mode of LLM determines that LLM can only be decoded serially, which limits its inference speed. Speculative decoding technique can be used to decode L…
-
### 🚀 The feature, motivation and pitch
For LLM inference, requests per second(QPS) is not constant. It needs launch vllm engine on demand. For elastic instance, it's significance to reduce TTFT(Time…
-
**What would you like to be added/modified**:
This issue aims to build a cloud-edge collaborative inference framework for LLM on KubeEdge-Ianvs. Namely, it aims to help all cloud-edge LLM develop…
-
The llm module is support infer use `vllm` or use multi GPUs?
If not, when will these features be implemented ?
-
So we're having issues inferencing efficiently at scale, and of course we're processing the audio parts one by one as is default for inference, but is there any support for batch inference to speed th…
-
Hello Keller,
In the paper "Efficient 3PC for Binary Circuits with Application to Maliciously-Secure DNN Inference" (Usenix Security 2023), the authors point out that the truncpr protocol proposed …
-
Add a section for patterns/blueprints under https://tag-runtime.cncf.io/wgs/cnaiwg/
Blueprints/patterns are:
- Actionable/Hands-on
- Encapsulates well-accommodated patterns of Cloud Native for…
-
We propose [MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models](https://arxiv.org/pdf/2405.13053). Our proposed MeteoRA (Multiple-Tasks embedded LoRA) is a scalable and efficient framewor…