Open Abhranta opened 20 hours ago
-c 128 -b 128
Does it work when you increase these? It is only set like this in the first example.
No,. It doesn't work. I reduced the batch size from the default value after getting segmentation fault as I thought that reducing the context length would reduce the memory requirements. But it didn't do anything
It's because these types don't have dequantization functions. I will push a fix to avoid using BLAS for these types in the next days.
What happened?
I am trying to run a Q4_0_4_4 quantized Llama3 8B model. This is my config :
But I am able to run the same model without BLAS and with NEON on the same system :
I don't understand why this is happening.
Name and Version
version: 3891 (d5cb8684) built with cc (Debian 12.2.0-14) 12.2.0 for aarch64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
No response