Open TC72 opened 8 months ago
I don't really know, but I think that normally people choose the biggest one which works on their computer. How did it work out for you?
If you look at tools like LM Studio they mark some as recommended.
They say anything ending _0
like codellama-13b-instruct.Q4_0.gguf
is a legacy quantization method
.
For _K_S
they don't give an opinion but _K_M
do tend to be shown as recommended.
They also mention Q2
and Q3
models having a loss of quality.
So for me the sweetspot seems to be Q4_K_M
and Q5_K_M
.
I might try writing some kind of evaluation script to compare those based on quality of response and time taken.
I'm still learning about running models locally. Could I ask how you decide which version of each model you will run? I see different versions like Q5_K_S and Q4_K_M. I understand the main driver is memory when choosing between 7B, 13B, 34B, etc but how do you decide which quantization is right?
I'm on a 32GB M2 Max MacBook Pro.