-
Vulkan may be not the the best/fastest/easiest or so solution for inference, but is probably most portable GPU acceleration approach.
Is anyone working actively to add support for it? And if so wha…
-
**Is your feature request related to a problem? Please describe.**
Currently, all the code profiling is custom based on internal functions from psutil and/or pytorch. It would be great to have a more…
-
Are there any runnable demos of using Sparse-QAT/PTQ (2:4) to accelerate inference, such as applying PTQ to a 2:4 sparse LLaMA for inference acceleration? I am curious about the potential speedup rati…
-
Thanks for your interesting work! I have tried to infer a 75 frames video on A100 in 512*768, which will take about 3mins. I also tried to use more cards, however, it only generates more videos :( . D…
-
Does anyone have tried export model into onnx and inference using ONNXRuntime?
-
### The bug
Click on "Empty Recycle Bin" in the "Recycle Bin" section, and it will prompt "0 items have been permanently deleted", making it impossible to empty the Recycle Bin.
![image](https://g…
-
### Is there an existing issue for this?
- [X] I have searched the existing issues and checked the recent builds/commits of both this extension and the webui
### What happened?
Thank you for this p…
-
**Describe the bug**
While modified llama3 to llama3.1 as "meta-llama/Meta-Llama-3.1-8B-Instruct".
The model can be managed to download. However it prompt error while sending the input.
> The a…
-
### 🚀 The feature, motivation and pitch
DeepSpeed-FP6: An Optimization Approach from Microsoft
### Alternatives
Microsoft recently proposed an optimization approach called DeepSpeed-FP6. While it c…
-
### Prerequisites
- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.…