Beep6581 / RawTherapee

A powerful cross-platform raw photo processing program
https://rawtherapee.com
GNU General Public License v3.0
2.75k stars 313 forks source link

Enhancement request: Faster build for modern windows machines #7155

Open chaimav opened 1 month ago

chaimav commented 1 month ago

Current Windows builds are not optimized for the number of threads for modern processors. Manually setting the number of threads for Wavelets yields increased performance over the automatic setting. Threads

A full explanation from HIRAM is here: https://discuss.pixls.us/t/how-to-optimize-rawtherapee/44786/15

Benitoite commented 1 month ago

OP @chaimav has tested dev github build artifact for windows-release and and 5.10 offical windows release on rawtherapee.com. On: Windows 11, i7 13700

Benitoite commented 1 month ago

For a fast windows 11 optimized build, I would recommend testing skylake-raptorlake: --march=skylake --mtune=raptorlake -O3 -flto

Note: LTO doesn’t seem to work for the windows build. See: https://github.com/Beep6581/RawTherapee/issues/5379

Skylake represents the 8th gen intel architecture (minimum for windows 11), Raptor Lake is the 13th gen (including the specific i7 mentioned above). This tuning could be added as processor target number 11.

These 2017-2022 tuning architectures should be available in gcc-13 and later.

For a fast windows 10 build, existing processor target number 10 (sandybridge-ivybridge) could be used.

Example github CI build:

Workflow link: https://github.com/Benitoite/RawTherapee/actions/runs/10122666106

about_this_build:

Version: nightly-github-actions-810-g2a8e549b7
Branch: fastwin
Commit: 2a8e549b7
Commit date: 2024-07-27
Compiler: cc 14.1.0
Processor: skylake-raptorlake
System: Windows
Bit depth: 64 bits
Gtkmm: 3.24.9
Lensfun: 0.3.4.0
libjxl: 0.10.3
Build type: release
Build flags:  -std=c++11 -ffp-contract=off -march=skylake -mtune=raptorlake -Werror=unused-label -Werror=delete-incomplete -fno-math-errno -Wno-attributes -Wall -Wuninitialized -Wcast-qual -Wno-deprecated-declarations -Wno-unused-result -Wunused-macros -fopenmp -Werror=unknown-pragmas -O3 -DNDEBUG -ftree-vectorize
Link flags:  -march=skylake -mtune=raptorlake
OpenMP support: ON
MMAP support: ON
Build OS: MINGW64_NT-10.0-20348 3.5.3-d8b21b8c.x86_64 x86_64
Build date: Sat, 27 Jul 2024 13:02:36 +0000 UTC
Build epoch: 1722085356
Build UUID: efe7a454-0778-4f9a-90bc-74f2f1a12109

Runs ok on Windows10 / i7-6700 (Skylake). I'm not the expert at generating timing comparison data. Just by seat-of-the-pants it does seem way faster.

Lawrence37 commented 1 month ago

@chaimav There are two things going on here.

The first is optimal thread usage. There might be (just my speculation) a limit on the number of threads when it is set to automatic. If this is the case, manually setting the maximum threads to at least the number of logical cores you have could give you the best result. Based on the specs you provided in the Pixls thread, that would be 24. Be cautious about possible performance issues when using high values (see #6730), so some experimentation could be required.

The other thing is build optimization for more recent architectures. The official builds are generic, which means they work for a large percentage of computers (I'm only talking about x86). We could provide one or more optimized builds for more recent architectures. I'd like to see what the performance improvements are for various processor targets to determine the best compromise between performance and compatibility. We also have to think about AMD and how to make the optimized builds available without making it confusing for those who are not techies.

chaimav commented 1 month ago

Is it possible for the installer to determine the CPU on install? Or even have a manual option during install that defaults to the current version if the user doesn't select another option?

chaimav commented 1 month ago

I tried @Benitoite build, and still saw measurable improvement by increasing the number of threads Using a previous edit that has Wavelets > Sharp-mask & clarity enabled and just panned side to side. Using a crude timing method (stopwatch on my smartphone) I did multiple pans. With threads set to 0, the processing bar showed up for about 2.5 seconds, but with threads set to 16 it was there for about 1.4 seconds.

Lawrence37 commented 1 month ago

I'm not sure if it's possible to detect the CPU architecture. I took a quick look at the Inno Setup documentation and didn't see anything that can help.