Beep6581 / RawTherapee

A powerful cross-platform raw photo processing program
https://rawtherapee.com
GNU General Public License v3.0
2.75k stars 313 forks source link

Settings > Performance > Threads: Automatic setting works poorly on CPUs with many threads #7160

Open chaimav opened 1 month ago

chaimav commented 1 month ago

The default setting of 0 (Automatic) does not perform well on modern Intel CPUs with high thread counts. Tested on:

Test: Enabling Wavelets > Sharp-mask and clarity and panning while zoomed in. Raising from automatic to a higher number greatly reduces processing time post panning.

Processor: 12th Gen Intel(R) Core™ i7-12700H (20 CPUs), ~2.7GHz Memory: 32768MB RAM Card name: NVIDIA GeForce RTX 3070 Ti Laptop GPU (RawTherapee does not take advantage of the GPU…) SSD: 1 terabyte - NVMe SAMSUNG MZVL21T0HCLR-00BT7

video here: https://discuss.pixls.us/t/how-to-optimize-rawtherapee/44786/27?u=chaimav

Also tested on: Processor: 13th Gen Intel(R) Core(TM) i7-13700 2.10 GHz Memory: 16Gb Card name: None (integrated graphics) SSD: 1 TB NVMe Micron_2400_MTFDKBA1T0QFM

Can the automatic setting be improved to detect higher processors?

Lawrence37 commented 1 month ago

I checked the code. The automatic detection works fine, but there are differences in what the code does depending on if the number of threads is set to 0 or not. Oddly enough, performance is actually worse for CPUs with few cores. Can you enable verbose mode and see what gets printed in the terminal/command prompt? Here's what I see.

Automatic:

Ip Wavelet uses 1 main thread(s) and up to 4 nested thread(s) for each main thread
Level decomp L=1
CHRO var0=0.000001 va1=0.000001 va2=0.000001 va3=0.000001 va4=0.000001 val5=0.000001 va6=0.000010
Leval decomp a=0
Leval decomp b=0
Ip Wavelet uses 1 main thread(s) and up to 4 nested thread(s) for each main thread
Level decomp L=7
CHRO var0=0.000001 va1=0.000001 va2=0.000001 va3=0.000001 va4=0.000001 val5=0.000001 va6=0.000010
Leval decomp a=7
Leval decomp b=7

Manual (4 threads, the maximum for my computer):

Ip Wavelet uses 1 main thread(s) and up to 1 nested thread(s) for each main thread
Level decomp L=1
CHRO var0=0.000001 va1=0.000001 va2=0.000001 va3=0.000001 va4=0.000001 val5=0.000001 va6=0.000010
Leval decomp a=0
Leval decomp b=0
Ip Wavelet uses 1 main thread(s) and up to 1 nested thread(s) for each main thread
Level decomp L=7
CHRO var0=0.000001 va1=0.000001 va2=0.000001 va3=0.000001 va4=0.000001 val5=0.000001 va6=0.000010
Leval decomp a=7
Leval decomp b=7

For me, automatic is about 30% faster.

chaimav commented 1 month ago

Can you enable verbose mode and see what gets printed in the terminal/command prompt?

How do I do that?

Lawrence37 commented 1 month ago

To enable verbose mode, find your options file (see https://rawpedia.rawtherapee.com/File_Paths#Config). Make sure RawTherapee is closed, then open the options file in a text editor such as Notepad. Find the line that says Verbose=false and change it to Verbose=true. Save it.

Open the terminal or command prompt. Run RawTherapee from there. You may need to add the -w option as indicated here: https://rawpedia.rawtherapee.com/Command-Line_Options#RawTherapee_GUI Example: rawtherapee.exe -w You may first need to navigate to where RawTherapee is installed. For example: cd /D "C:\Program Files\RawTherapee\5.10"

chaimav commented 1 month ago

Is the command terminal supposed to show something as soon as I change the threads option?

Lawrence37 commented 1 month ago

If it does show something, you can ignore it. We are only interested in what it shows when the preview updates.

chaimav commented 1 month ago

Its not showing anything. Am I doing something wrong? (I had to use .\rawtherapee.exe -w because rawtherapee.exe -w gave an error)

Lawrence37 commented 1 month ago

I have encountered this issue before, but I don't remember how to get it to show messages because it's been a while since I debugged in Windows. Maybe @Desmis can tell us how to see the verbose messages.

Desmis commented 1 month ago

@Lawrence37 I am not at all a specialist...

The options file, in : C:\Users\jdesm\AppData\Local\RawTherapee5-dev [General] TabbedEditor=true StoreLastProfile=true StartupDirectory=last StartupPath=D:\Coutest DateFormat=%y-%m-%d AdjusterMinDelay=100 AdjusterMaxDelay=200 MultiUser=true Language=English (US) LanguageAutoDetect=false Theme=RawTherapee - Legacy Version=5.10-452-g1a418552a DarkFramesPath= FlatFieldsPath= CameraProfilesPath= LensProfilesPath= Verbose=true Cropsleep=50 Reduchigh=0.84999999999999998 Reduclow=0.84999999999999998 Detectshape=true Fftwsigma=true

[External Editor] EditorKind=1 GimpDir=

and after in console Mingw64 ./rawtherapee w

chaimav commented 1 month ago

@Desmis that worked, apparently the dash is what threw it off. I needed to type .\rawtherapee.exe w and not.\rawtherapee.exe -w

Here is the output, I hope it is useful (because I don't really understand it) Manually set:

Ip Wavelet uses 1 main thread(s) and up to 1 nested thread(s) for each main thread
Level decomp L=1
CHRO var0=0.000001 va1=0.000001 va2=0.000001 va3=0.000001 va4=0.000001 val5=0.000001 va6=0.000010
Leval decomp a=0
Leval decomp b=0
Ip Wavelet uses 1 main thread(s) and up to 1 nested thread(s) for each main thread
Level decomp L=7
CHRO var0=0.000001 va1=0.000001 va2=0.000001 va3=0.000001 va4=0.000001 val5=0.000001 va6=0.000010
Leval decomp a=7
Leval decomp b=7

Automatic:

Ip Wavelet uses 1 main thread(s) and up to 24 nested thread(s) for each main thread
Level decomp L=1
CHRO var0=0.000001 va1=0.000001 va2=0.000001 va3=0.000001 va4=0.000001 val5=0.000001 va6=0.000010
Leval decomp a=0
Leval decomp b=0
Ip Wavelet uses 1 main thread(s) and up to 24 nested thread(s) for each main thread
Level decomp L=7
CHRO var0=0.000001 va1=0.000001 va2=0.000001 va3=0.000001 va4=0.000001 val5=0.000001 va6=0.000010
Leval decomp a=7
Leval decomp b=7
Lawrence37 commented 1 month ago

Interesting. It says it uses one thread when you manually set it, but uses 24 threads when it's automatic. Theoretically, it should be much faster with 24 threads (automatic) which is the opposite of what you observe.

chaimav commented 1 month ago

I was puzzled by that as well. I guess it possible I mixed them up?

Benitoite commented 1 month ago

I am puzzled by both settings only using one main thread and a difference in nested threads. The GUI has an algorithm to calculate an optimum setting, but the efficiency depends on memory and wavelet levels.

We ran a controlled experiment using the -cli on some different CPUs including @chaimav 's.

================================
Available threads = 24  /  CPU = 13th Gen Intel(R) Core(TM) i7-13700  /  2100 MHz  /  Target = Processor: generic x86
27082 total milliseconds elapsed (average of 5 runs) using OMP_NUM_THREADS = 2
18057 total milliseconds elapsed (average of 5 runs) using OMP_NUM_THREADS = 4
14663 total milliseconds elapsed (average of 5 runs) using OMP_NUM_THREADS = 8
14928 total milliseconds elapsed (average of 5 runs) using OMP_NUM_THREADS = 16
================================

I believe we maxxed out the efficiency by offering OMP threads that closely matched the wavelet levels. Moving up to 16 only shaved a few hundred milliseconds off a pretty long and duplicative routine.

A similar data point measured by @silviogrosso shows a similar optimization around 8 threads:

================================
Available threads = 20 / CPU = 12th Gen Intel(R) Core(TM) i7-12700H /  2700 MHz / Target = Processor: generic x86
48426 total milliseconds elapsed (average of 5 runs) using OMP_NUM_THREADS = 2
28899 total milliseconds elapsed (average of 5 runs) using OMP_NUM_THREADS = 4
23573 total milliseconds elapsed (average of 5 runs) using OMP_NUM_THREADS = 8
25115 total milliseconds elapsed (average of 5 runs) using OMP_NUM_THREADS = 16
================================

Here, the increase to 16 from 8 was about 10% more inefficient.

Lawrence37 commented 1 month ago

It's probably not mixed up. My results show the same behavior. The two specific things I find interesting are (1) the use of 24 threads when the number of cores you have is 20 (the code puts a limit equal to the number of cores) and (2) why the thread count is still 1 after manually setting the threads so high (the threads used is calculated with a formula that should result in a number greater than 1).

Benitoite commented 1 month ago

@Lawrence37 @chaimav ‘s machine the 24 thread CPU is 8 Hyperthreaded Performance Cores and 8 Efficiency Cores, for a total of 16 + 8 =24 threads.

@silviogrosso ‘s computer has the 20 threads (6 P-cores, 8 E-cores).

Lawrence37 commented 1 month ago

Ok, that makes sense. I thought the first system specs in the original post was @chaimav's computer.

Benitoite commented 1 month ago

Ok, that makes sense. I thought the first system specs in the original post was @chaimav's computer.

I think @chaimav might have the two systems, but has only provided data from the 24-core so far.

chaimav commented 1 month ago

Ok, that makes sense. I thought the first system specs in the original post was @chaimav's computer.

I think @chaimav might have the two systems, but has only provided data from the 24-core so far.

Correct, I have only benchmarked one computer, an i7 13700 (full specs here https://www.amazon.com/dp/B0CFBDRMXT ). My previous computer was an i3 8100 which rotated to my wife. I it will be of value, I can run the scripts on it as well.

Lawrence37 commented 1 month ago

@chaimav I created a branch which respects the number of threads set in preferences when using wavelets. I'm interested in knowing what the performance is like for different manually-set values. Executables will be available for download at the bottom of this page in a few minutes: https://github.com/Beep6581/RawTherapee/actions/runs/10238785142

chaimav commented 1 month ago

@Lawrence37 I just tested RawTherapee_wavelet-thread-num_5.10-383-gf9bcf594b_win64_release with different numbers set for for threads and found no discernable difference with processing post scrolling**. Performance was similar to zero (automatic) of the standard dev build.

**Tested with a stopwatch so some error is to be expected, but on the regular Dev build, non zero numbers shows noticeable improvement

Lawrence37 commented 1 month ago

It's also slow with 1 thread? I expected it to have the same performance as dev with manual threads since both use 1 thread.

chaimav commented 1 month ago

I didn't try with 1, but I tried other numbers like 8 and 24. On the dev version, those are noticiably faster than 0. On this version I saw no difference between 0 and those numbers. I can try 1 when I get home tonight