Enable compiler optimizations to improve inference speed

This PR introduces a couple of simple compiler flags that can greatly improve inference speed

As described in section 5.1.2 of the paper Jie Xiao, Qianyi Huang, Xu Chen, Chen Tian. Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation, arXiv:2410.03613, the flag i8mm has been added to the architecture description for arm64-v8a processors. This flag supposedly enables the generation of machine instructions optimised for int8 math.

The flag -Ofast has been specified in CMakeLists.txt to enable compiler optimisations for any architecture. This change requires the flag -fno-finite-math-only to be specified so that we disable all the optimisations based on the assumption that floating point math cannot result in infinite.

With those changes, I was able to observe great performance improvements on my device (Motorola Edge 20) when using Llama3.2-1B-Q4K_M:

1568% improvement for prompt eval time, from 2.27 to 37.9 tokens/s
636% improvement for inference speed, from 1.30 to 9.58 tokens/s
86% energy usage reduction, from 364 to 50 μAh per generated token

andriydruk / LMPlayground

Enable compiler optimizations to improve inference speed #4