-
### What happened?
Llama 3.1 8B quantized after https://github.com/ggerganov/llama.cpp/pull/8676 fails the "wicks" problem that LLama 3 8B can answer correctly.
Prompt: `Making one candle requir…
-
Hi intel team,
I have pruned and quantized several models using your toolkit, and I'm currently aiming to do inference using your pipeline to my gpt2 code generation model. To do so I need to expor…
-
### What happened?
I use 7900xtx, only 3~t/s when I use llama.cpp inference qwen2-7b-instruct-q5_k_m.gguf, even if I set -ngl 1000 or -ngl 0, I still find that the VRAM usage of the GPU is very low, …
-
**Describe the bug**
![Screenshot 2024-05-04 at 4 41 10 AM](https://github.com/neelnanda-io/TransformerLens/assets/310981/69c34618-015f-4cd9-9ed6-4e0b295982e9)
I followed the instructions in `docs…
-
### What is the issue?
Hi everyone,
Sorry I don't have much time to write much; but going from 1.32 to 1.33, this:
```
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: CUDA_USE_TENS…
-
### System Info
- `transformers` version: 4.26.1
- Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.31
- Python version: 3.10.9
- Huggingface_hub version: 0.12.1
- PyTorch version (GPU?): 2.0…
moyix updated
4 months ago
-
### What happened?
### Problem
Some models produce a corrupted output when offloading to multiple CUDA GPUs. The problem disappears when offloading to a single GPU or using CPU only.
I was able…
-
I'm running Unsloth to fine tune LORA the Instruct model on llama3-8b .
1: I merge the model with the LORA adapter into safetensors
2: Running inference in python both with the merged model direct…
-
After running several inference case, the output folder under stats.json which collected 3 infererence_runtime successfully. But it does not on the browser http://localhost:8000/ under Efficiency Metr…
-
llama3 released
would be happy to use with llama.cpp
https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6
https://github.com/meta-llama/llama3