andriydruk / LMPlayground

Language Model Playground
MIT License
16 stars 4 forks source link

Enable compiler optimizations to improve inference speed #4

Open imatrisciano opened 3 weeks ago

imatrisciano commented 3 weeks ago

This PR introduces a couple of simple compiler flags that can greatly improve inference speed

As described in section 5.1.2 of the paper Jie Xiao, Qianyi Huang, Xu Chen, Chen Tian. Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation, arXiv:2410.03613, the flag i8mm has been added to the architecture description for arm64-v8a processors. This flag supposedly enables the generation of machine instructions optimised for int8 math.

The flag -Ofast has been specified in CMakeLists.txt to enable compiler optimisations for any architecture. This change requires the flag -fno-finite-math-only to be specified so that we disable all the optimisations based on the assumption that floating point math cannot result in infinite.

With those changes, I was able to observe great performance improvements on my device (Motorola Edge 20) when using Llama3.2-1B-Q4K_M:

ScottArbeit commented 2 weeks ago

Just because I was curious about the compiler flag changes, I asked GPT-4o for more details, which you can find at https://chatgpt.com/share/67343ef3-8ae4-8003-9c41-82ffa7cf7f5a.

Thanks for working on LM Playground! ❤️