RaspSDR / server

The web-888 web server code, Clone from kiwisdr with lots of changes
14 stars 5 forks source link

Possible GCC Optimizations #39

Open RobRich999 opened 1 month ago

RobRich999 commented 1 month ago

If interested in exploring possible performance tweaks, I can confirm the websdr.bin binary builds and works with GCC (v13.2.1) graphite and fipa-pta optimizations enabled. Note I have not tested all receiver options, so YMMV here. I will leave it up to someone else to figure out any actual performance difference(s).

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html https://github.com/InBetweenNames/gentooLTO/blob/master/sys-config/ltoize/files/make.conf.lto.defines

Ideally LTO would be used, especially for further improving the fipa-pta pass, plus potentially using -fdevirtualize-at-ltrans as well. Previously I have done LTO build of websdr.bin, but I did not get around to actually testing it at the time.. IIRC, there were several ODR and similar warnings. I might give it another go in the near future.

set(CMAKE_CXX_FLAGS "-Wall -fsingle-precision-constant -pthread -pipe")
set(CMAKE_CXX_FLAGS_RELEASE "-g -O3 -fgraphite-identity -floop-nest-optimize -fipa-pta")

set(CMAKE_C_FLAGS "-Wall -fsingle-precision-constant -pipe")
set(CMAKE_C_FLAGS_RELEASE "-g -O3 -fgraphite-identity -floop-nest-optimize -fipa-pta")

set(PLATFORM_FLAGS -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mfloat-abi=hard -ffast-math -fsingle-precision-constant -mvectorize-with-neon-quad -fgraphite-identity -floop-nest-optimize -fipa-pta -pipe)
RobRich999 commented 1 month ago

I have a working dff48a7 release build websdr.bin binary with LTO, graphite, -fipa-pta, and -fdevirtualize-at-ltrans optimizations.

I did not build HFDL. Also I turned off debug symbols for my build, and it dropped the binary size to under 7MB.

I have not done much testing, but basic features seem to be working as intended. YMMV, of course.

cmake .. -DCMAKE_C_FLAGS_RELEASE="-O3 -DNDEBUG -fgraphite-identity -floop-nest-optimize -flto=auto -fipa-pta -fdevirtualize-at-ltrans -pipe" -DCMAKE_CXX_FLAGS_RELEASE="-O3 -DNDEBUG -fgraphite-identity -floop-nest-optimize -flto=auto -fipa-pta -fdevirtualize-at-ltrans -pipe -pthread" -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=TRUE
set(CMAKE_INTERPROCEDURAL_OPTIMIZATION TRUE)
set(CMAKE_CXX_FLAGS_RELEASE "-O3 -fgraphite-identity -floop-nest-optimize -fipa-pta -flto=auto -fdevirtualize-at-ltrans -pipe")
set(CMAKE_C_FLAGS_RELEASE " -O3 -fgraphite-identity -floop-nest-optimize -fipa-pta -flto=auto -fdevirtualize-at-ltrans -pipe")
set(PLATFORM_FLAGS -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mfloat-abi=hard -ffast-math -fsingle-precision-constant -mvectorize-with-neon-quad -fgraphite-identity -floop-nest-optimize -fipa-pta -flto=auto -fdevirtualize-at-ltrans -pipe)

gcc_opt.diff.txt

howard0su commented 1 month ago

Welcome a PR with a report on the perf gain.

RobRich999 commented 2 weeks ago

Suppose one could try measuring processor utilization, though as noted, I will have to leave that to someone else for now.

Nonetheless, the best bet for a starting point (IMHO) is LTO as it rarely degrades performance except in corner cases.

howard0su commented 2 weeks ago

I tried LTO, which works fine on my devbox. I didn't carefully design a benchmark to check the performance difference. but it didn't shows significance difference.