-
While we support batched inference like other constrained decoding libraries, the current implementation can be parallelized further. In particular, we can mask logits in batch and run several `kbnf` …
-
Hello Allegro Team,
I hope this message finds you well. I would like to propose the integration of xDiT, a scalable inference engine for Diffusion Transformers (DiTs), into the Allegro ecosystem. x…
-
### Feature request
This request aims to introduce functionality to delete specific adapter layers integrated with PEFT (Parameter-Efficient Fine-Tuning) within the Hugging Face Transformers librar…
-
### Requested feature
First of all, congrats on the amazing work !
I have two improvement ideas that might help simplify using this library in a wider range production workloads:
* Supp…
-
### 🚀 Feature
Adding support for head-specific KV cache compression which employs variable compression rates for each attention head.
### Motivation
Ada-KV[1] has demonstrated that employing differ…
-
Model generates only garbage.
Sample: https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/llama-3-8b-32k-sampling.ipynb
NeuronSDK2.19 PyTorc…
-
I'm experiencing a std::bad_alloc exception when attempting to load a large model (~2.16 GB) using MediaPipe's LLM inference capabilities on an iPhone 16 Pro. The app crashes during model initializati…
-
https://proceedings.mlsys.org/paper_files/paper/2023/file/523f87e9d08e6071a3bbd150e6da40fb-Paper-mlsys2023.pdf
-
### Describe the feature request
PyTorch / HF (previously branded as BetterTransformer) now have some support for NJT representation:
- https://github.com/onnx/onnx/issues/6525
This allows to have e…
-
coregraphics is currently largely unexposed via darwinkit.
Some of the functions are needed for efficient coreml inference, especially if creating CGImages from raw data or bytes.
e.g. `CGDataPr…