danielgross / localpilot

MIT License
3.32k stars 141 forks source link

Question about model choices #22

Open TC72 opened 8 months ago

TC72 commented 8 months ago

I'm still learning about running models locally. Could I ask how you decide which version of each model you will run? I see different versions like Q5_K_S and Q4_K_M. I understand the main driver is memory when choosing between 7B, 13B, 34B, etc but how do you decide which quantization is right?

I'm on a 32GB M2 Max MacBook Pro.

fletchgqc commented 8 months ago

I don't really know, but I think that normally people choose the biggest one which works on their computer. How did it work out for you?

TC72 commented 8 months ago

If you look at tools like LM Studio they mark some as recommended. They say anything ending _0 like codellama-13b-instruct.Q4_0.gguf is a legacy quantization method. For _K_S they don't give an opinion but _K_M do tend to be shown as recommended.

They also mention Q2 and Q3 models having a loss of quality.

So for me the sweetspot seems to be Q4_K_M and Q5_K_M. I might try writing some kind of evaluation script to compare those based on quality of response and time taken.