Lllama 3 8b instruct and impossible slow speeds for generating tokens.

AnonymousCaptain commented 2 months ago

Any GGUF quantization I'm using for LLAMA 3 8b instruct is impossible for me to generate a prompt. The time for the model to process the prompt is slower than other models (110 seconds for roughly 450 tokens) but the time to generate tokens is 112813ms per token AKA way less than 0.01tokens per second. I've not received any error messages for RAM or anything else.

Even stopping the generation takes longer than 2 or 3 minutes, it's no longer instant.

I'm currently using the latest released build of koboldcpp and using bartowski's imatrix gguf Q4KM made for koboldcpp b2710.

LostRuins commented 2 months ago

Could you check a few things:

Which backend are you using? CuBLAS?
What are the generation params? How many gpu offloaded layers?
What are your system specs?
Have you tried other models? How fast does Llama 2 7B work for you instead?
Have you tried other quants? Perhaps try a Q2_K, or a Q4_0 to compare

Providing a console output log would be helpful too, as it will show the launcher details.

AnonymousCaptain commented 2 months ago

Could you check a few things:

1. Which backend are you using? CuBLAS?

2. What are the generation params? How many gpu offloaded layers?

3. What are your system specs?

4. Have you tried other models? How fast does Llama 2 7B work for you instead?

5. Have you tried other quants? Perhaps try a Q2_K, or a Q4_0 to compare

Providing a console output log would be helpful too, as it will show the launcher details.

Hi, sorry about bothering you but this issue is now resolved. After some help from others I found out it was simply too many GPU layers being offloaded that caused the generation to just never start.

Just in case, I'll try to explain why I thought there was an issue with koboldcpp:

1 and 3. I'm using CuBlas on a 3050ti laptop (4gb) and 16gb RAM. Windows 10.

My generation parameters is 8192 context size (with context shift), BLAS of 512 and I offloaded all 33 layers on llama 3 8b GGUF imatrix q4km. I also put koboldcpp on high priority.
I have tried other 7b models like wizardlm2 GGUF Q6 and llama 2 GGUF Q6 prior to this issue and they ran fine. I also offloaded all gpu layers to them as well. It made me learn 2 mistakes when running llama 3 8b for the first time: I assumed all layers of llama 3 can be offloaded AND offloading all layers despite having no vram left over is a good practice.
I've only tried llama 3 8b Q6 and dropped down to Q4 because of the issue (which didn't solve my problem). Despite that I still believed it was an issue with koboldcpp because previous llama 3 GGUF uploads from NousResearch and QuantFactory still outputted tokens (prompt processing was incredibly slow on those uploads but I thought was just a quirk of the model).
Additionally, I tried to run 13b models before but they created memory error messages. It lead me to believe that if there was a problem with memory or vram, koboldcpp would have shown the error message but it never did when I ran llama 3 8b with my generation parameters.

In summary, I thought there was an issue with koboldcpp because I believed that the llama 3 8b GGUF Q4 file size was small enough to offload all my gpu layers and being able to do so with no memory error messages popping up. I've managed to solve the issue by simply reducing the GPU layers from 33 to 20, it was just an obvious mistake on my end.

I guess if there's one thing to mention is why didn't the memory error message pop up when my prompt was being processed?

LostRuins / koboldcpp

Lllama 3 8b instruct and impossible slow speeds for generating tokens. #800