-
Hi guys
I've just had reports that two specific Q4_0 70B models are outputting gibberish, and I've confirmed the same.
Example file with this issue: https://huggingface.co/TheBloke/Spicyboros-70…
-
Here are some outstanding issues for LoRA:
- [x] Base implementation (https://github.com/ggerganov/llama.cpp/pull/820)
- [ ] Improve LoRA application time with SIMD (AVX, AVX2) (https://github.com/g…
-
at @onefact we have been using wasm, but this won't work for the encoder-only or encoder-decoder models i've built (e.g. http://arxiv.org/abs/1904.05342). that's because the wasm vm is for the cpu (ha…
-
Dear Google AI Team,
I wish to express my strong interest in seeing Google Gemini Flash released to the open-source community.
As a developer and AI enthusiast, I have been incredibly impressed wi…
-
### Describe the issue
I am trying to replicate the following : [https://intel.github.io/intel-extension-for-pytorch/llm/llama3/xpu/](url) . While running the `python run_generation_gpu_woq_for_llama…
-
I’m curious if you will support Arc, neural compressor would particularly benefit those platforms! Thanks!
-
quantize after convert, problem occurs:
> ➜ llama ./llama.cpp/quantize ./chinese-llama-2-7b-hf/ggml-model-f16.gguf ./chinese-llama-2-7b-hf/ggml-model-q4_0.gguf 2
main: build = 2695 (bca40e98)
ma…
-
Running quantize with a target dtype of F32, F16, or Q8_0 can result in a Q6_K output tensor without --pure (ref https://github.com/ggerganov/llama.cpp/pull/5631#issuecomment-1965055798). This is surp…
-
AutoAWQ now supports Mixtral on the main branch. It requires that we do not quantize the `gate` in the model. To prevent quantizing and loading it as a quantized linear layer, you have to skip loading…
-
### Describe the issue
We are trying to quantize our proprietary model based on RetinaNet using TensorRT's model optimization library. The following warning was raised: **"Please consider running pre…