efficient-inference Search Results

Dan-wanna-M/formatron #3

Efficient batched inference

While we support batched inference like other constrained decoding libraries, the current implementation can be parallelized further. In particular, we can mask logits in batch and run several `kbnf` …

Dan-wanna-M updated 3 months ago

rhymes-ai/Allegro #34

Did you consider to apply xDiT for parallel inference

Hello Allegro Team, I hope this message finds you well. I would like to propose the integration of xDiT, a scalable inference engine for Diffusion Transformers (DiTs), into the Allegro ecosystem. x…

feifeibear updated 1 week ago

huggingface/transformers #34649

Add functionality for deleting adapter layers in PEFT integr…

### Feature request This request aims to introduce functionality to delete specific adapter layers integrated with PEFT (Parameter-Efficient Fine-Tuning) within the Hugging Face Transformers librar…

itsskofficial updated 2 weeks ago

DS4SD/docling #425

Improve Deployment Efficiency: Integrate ONNX Runtime as Inf…

### Requested feature First of all, congrats on the amazing work ! I have two improvement ideas that might help simplify using this library in a wider range production workloads: * Supp…

CVxTz updated 4 days ago

NVIDIA/kvpress #7

Request for Head-Specific KV Cache Compression Feature

### 🚀 Feature Adding support for head-specific KV cache compression which employs variable compression rates for each attention head. ### Motivation Ada-KV[1] has demonstrated that employing differ…

FFY0 updated 6 hours ago

aws-neuron/aws-neuron-samples #82

Llama3 8B 32K sample generates garbage

Model generates only garbage. Sample: https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/llama-3-8b-32k-sampling.ipynb NeuronSDK2.19 PyTorc…

samir-souza updated 2 weeks ago

google-ai-edge/mediapipe #5757

std::bad_alloc Exception When Loading Large Model on iOS wit…

I'm experiencing a std::bad_alloc exception when attempting to load a large model (~2.16 GB) using MediaPipe's LLM inference capabilities on an iPhone 16 Pro. The app crashes during model initializati…

lightScout updated 2 days ago

pentium3/sys_reading #348

Efficiently Scaling Transformer Inference

https://proceedings.mlsys.org/paper_files/paper/2023/file/523f87e9d08e6071a3bbd150e6da40fb-Paper-mlsys2023.pdf

pentium3 updated 8 months ago

microsoft/onnxruntime #22764

[Feature Request] Jagged batches support (NJT) for Transform…

### Describe the feature request PyTorch / HF (previously branded as BetterTransformer) now have some support for NJT representation: - https://github.com/onnx/onnx/issues/6525 This allows to have e…

vadimkantorov updated 3 weeks ago

progrium/darwinkit #271

complete coregraphics support

coregraphics is currently largely unexposed via darwinkit. Some of the functions are needed for efficient coreml inference, especially if creating CGImages from raw data or bytes. e.g. `CGDataPr…

nmichlo updated 2 weeks ago

1000+ results for efficient-inference

1000+ results
for efficient-inference