Vali-98 / ChatterUI

Simple frontend for LLMs built in react-native.
GNU Affero General Public License v3.0
555 stars 28 forks source link

Question Gemma speed #81

Closed GameOverFlowChart closed 2 months ago

GameOverFlowChart commented 2 months ago

Does someone know what makes Gemma 2 9b based models run so slow (locally) compared to llama 3 8b? Sure it's bigger but the output speed difference is huge. Is it how it is or is there a known issue with gemma that's worked on (maybe at llamacpp?).

sais-github commented 2 months ago

Gemma is just slow, i see almost identical speeds from gemma-2-2b and minitron-4b (2.61B vs 4.51B params)

Also keep in mind gemma ctx is double that of llama in size so it might be out of memory if used at the same ctx? I'm unsure how ChatterUI handles OOMs though

Vali-98 commented 2 months ago

I believe this is due to the model being deeper layer-wise, requiring more computation vs llama 8b. I think this is an intrinsic feature of Gemma 2 and not an issue with ChatterUI.