Closed GameOverFlowChart closed 2 months ago
Gemma is just slow, i see almost identical speeds from gemma-2-2b and minitron-4b (2.61B vs 4.51B params)
Also keep in mind gemma ctx is double that of llama in size so it might be out of memory if used at the same ctx? I'm unsure how ChatterUI handles OOMs though
I believe this is due to the model being deeper layer-wise, requiring more computation vs llama 8b. I think this is an intrinsic feature of Gemma 2 and not an issue with ChatterUI.
Does someone know what makes Gemma 2 9b based models run so slow (locally) compared to llama 3 8b? Sure it's bigger but the output speed difference is huge. Is it how it is or is there a known issue with gemma that's worked on (maybe at llamacpp?).