Beep6581 / RawTherapee

A powerful cross-platform raw photo processing program
https://rawtherapee.com
GNU General Public License v3.0
2.77k stars 313 forks source link

Time reduction of Noise Reduction #1955

Closed Beep6581 closed 9 years ago

Beep6581 commented 9 years ago

Originally reported on Google Code with ID 1971

I tried to further optimize Noise reduction, but I don't get more than a 5% speedup.
My experience from past optimizations is, that the reason could be the known AMD 4-way-associative
L1-cache problem. To test this thing, I would be glad, if every developer and tester
could make a test processing the same RAW with the same settings and post the timings
(average or median of 7 pictures) in this Issue. The results could help me to detect
whether it's an AMD 4-way-associative L1-cache problem or not. If it's an AMD 4-way...,
I know where to search and can try to solve it. If I can solve it, I can try further
opts....

Here's the RAW and the settings and to make it easier also a patch with timing-code.

RAW: http://www.i-weyrich.de/rt/DENOISE.NEF

Settings: Default ISO high, but Luminance set to zero in and Impulse Noise reduction
disabled

Patch: attached

Ingo

PS: Reine, you don't need to download the 50 MB RAW. You already have it :-)

Reported by heckflosse@i-weyrich.de on 2013-09-12 20:31:04


Beep6581 commented 9 years ago
Will do that this w.e.

Reported by natureh.510 on 2013-09-12 23:29:28

Beep6581 commented 9 years ago
Hi Ingo, the patch file seems to include only the StopWatch and no optimizations, is
this intentional? Did you just want to get the understanding of the "before" state
on various architectures?

Reported by michaelezra000 on 2013-09-13 02:22:22

Beep6581 commented 9 years ago
on i7 3930K CPU @3.2GHz - 9170ms

Reported by michaelezra000 on 2013-09-13 02:49:06

Beep6581 commented 9 years ago
This 50mm F1.8 lens is a little marvel, I have to say!

Reported by michaelezra000 on 2013-09-13 03:00:12

Beep6581 commented 9 years ago
Hello Ingo

On i7 3630 CPU 2.4GHz ==> 12255 ms
:)

Reported by jdesmis on 2013-09-13 05:24:46

Beep6581 commented 9 years ago
Ingo, will try tonight!

Reported by johan@birkagatan.com on 2013-09-13 05:32:22

Beep6581 commented 9 years ago
@ #2: Michael, yes!

Reported by heckflosse@i-weyrich.de on 2013-09-13 06:10:14

Beep6581 commented 9 years ago
I wanted to chime and contribute,since Ingo said it would have been interesting to collect
data on lower-end hardware too,so here is my result:

On Intel i3-2100 CPU @ 3.10GHz --> RGB_denoise took 27130ms

Reported by msth67 on 2013-09-13 10:48:55

Beep6581 commented 9 years ago
On i7-2670QM @ 2.2GHz (but automatically over-clocked @ 2.8GHz) -> RGB_denoise took
~15300ms

Reported by natureh.510 on 2013-09-13 14:54:51

Beep6581 commented 9 years ago
AMD Athlon IIX4 640 Processor 3,00 GHz; RGB_denoise took 34606ms

Reported by willemtermeer on 2013-09-13 15:42:18

Beep6581 commented 9 years ago
At this point, i think that posting some build informations may be useful:

Branch: default
Compiler: gcc 4.6.1
Processor: generic x86
System: Windows
Bit depth: 64 bits
Build type: Release
Build flags: -mthreads -mwin32 -m64 -msse2 -mfpmath=sse -mtune=generic -fopenmp  -O3
Link flags: -mwin32 -m64 -mthreads -static -mtune=generic -s -O3
OpenMP support: ON

Just for fun, I'll try that on the Linux Mint OS (dual boot), with more up-to-date
Gcc but with the very same PC, of course.

Reported by natureh.510 on 2013-09-13 15:47:35

Beep6581 commented 9 years ago
Hello, i7 2600K: 10 586 ms

Reported by johan@birkagatan.com on 2013-09-13 15:51:19

Beep6581 commented 9 years ago
Hi, and my results are:
RGB_denoise took 25482ms (avarage of 3 run's)

Buildinfo (untouched standard generated flags)
Branch: default
Version: 4.0.11.48
Compiler: cc 4.7.3
Processor: AMD Phenom2 x4@3.4GHz
System: Linux
Bit depth: 64 bits
Gtkmm: V2.24.2
Build type: Release
Build flags:   -fopenmp -O3 -DNDEBUG
Link flags:    
OpenMP support: ON
MMAP support: ON

Reported by reine.edvardsson on 2013-09-13 17:09:54

Beep6581 commented 9 years ago
Thank you very much!!! Now I'm almost sure, that there is one of the typical 4-way-associative
L1-cache problems. I have to solve that first, because without solving it, all my other
tries to speed up Denoise will be shadowed by this problem.

Ingo

Reported by heckflosse@i-weyrich.de on 2013-09-13 22:04:54

Beep6581 commented 9 years ago
Because I got the necessary responses to my question, I removed the NEF from my webspace.
Thanks a lot to all who helped me with this test :-)

Ingo

Reported by heckflosse@i-weyrich.de on 2013-09-13 23:11:37

Beep6581 commented 9 years ago
Just for fun, here is the numbers for the Linux platform with the same laptop (i7-2670QM
@ 2.2GHz, but automatically over-clocked @ 2.8GHz on Windows):

------------ ( comment 9 & 11 ) -------------------------------

Branch: default
Compiler: gcc 4.6.1
Processor: generic x86
System: Windows7
Bit depth: 64 bits
Build type: Release
Build flags: -mthreads -mwin32 -m64 -msse2 -mfpmath=sse -mtune=generic -fopenmp  -O3
Link flags: -mwin32 -m64 -mthreads -static -mtune=generic -s -O3
OpenMP support: ON

-> RGB_denoise took ~15300ms

---------------------------------------------------------------

Branch: default
Compiler: gcc 4.7.3
Processor: generic x86
System: Linux Mint 15
Bit depth: 64 bits
Gtkmm: V2.24.2
Build type: Release
Build flags: -m64 -msse2 -mfpmath=sse -mtune=generic -fopenmp -DNDEBUG -O3 -funroll-loops
Link flags: -m64 -mtune=generic -s -O3
OpenMP support: ON

-> RGB_denoise took ~14400ms  (5.88% less than Windows, quite nice :) )

---------------------------------------------------------------

Reported by natureh.510 on 2013-09-18 23:05:00

Beep6581 commented 9 years ago
There are some differences in code-generation of gcc between Linux 64 bit and Win 64
bit. One is that on Linux up to 14 registers can be used to pass parameters to a function
and with Win only 8 can be used but only for the first 4 parameters so it's limited
to 4 on Win.

http://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions

Ingo

Reported by heckflosse@i-weyrich.de on 2013-09-19 21:15:18

Beep6581 commented 9 years ago
We can get some speedup in denoise-preview by changing the var tilesize of RGB_denoise(..)
from the actual value of 1024 to 512 or 256. With tilesize 1024 the number of used
cores is limited by the size of the preview area (e.g. 2 cores at my system). Changing
the tilesize to a lower value allows usage of more cores (and so faster processing)
in preview. If somebody wants to test, just place a StopWatch in RGB_denoise and change
the var tilesize (NOT TS) to one of the above mentioned values.

Ingo

Reported by heckflosse@i-weyrich.de on 2014-01-05 20:08:18

Beep6581 commented 9 years ago
Hi Ingo, what about variable overlap? should that be lowered correspondingly?
E.g.:
const int tilesize = 512;
const int overlap = 64;

Reported by michaelezra000 on 2014-01-05 20:35:06

Beep6581 commented 9 years ago
I played with both, fastest for preview was tilesize 256 and overlap 32.

Reported by heckflosse@i-weyrich.de on 2014-01-05 20:58:47

Beep6581 commented 9 years ago
Timing on my i7 D800e with chroma NR=15:
1024/128: full-render - 9100 ms; preview 864 ms
 256/ 32: full-render - 7320 ms; preview 540 ms

However, when comparing images before and after, I see very prominent checkerboard
pattern of tiles.

Reported by michaelezra000 on 2014-01-05 21:29:42

Beep6581 commented 9 years ago
Of course. I meant to use this values only for preview (at least in first step). For
full processing, we could use the old ones, if they are better.

Reported by heckflosse@i-weyrich.de on 2014-01-05 21:33:18

Beep6581 commented 9 years ago
It is puzzling though, why this causes such difference in output contents.
could be a threading issue? Now that more cores are engaged?

Reported by michaelezra000 on 2014-01-05 21:35:55

Beep6581 commented 9 years ago
Output tiles are feathered together (line 625 ...) so there are differences when changing
the size of tiles.

Reported by heckflosse@i-weyrich.de on 2014-01-05 21:51:51

Beep6581 commented 9 years ago
Ingo please see issue 1458, just FYI.

Reported by entertheyoni on 2014-01-05 23:25:07

Beep6581 commented 9 years ago
DrSlony, could you point me to the relevant comment of this issue, please?

Reported by heckflosse@i-weyrich.de on 2014-01-05 23:29:43

Beep6581 commented 9 years ago
Found it.

Reported by heckflosse@i-weyrich.de on 2014-01-06 11:08:07

Beep6581 commented 9 years ago
Vectorized another part of denoise. Please don't test on Win32. The SSE function headers
for Win32 are not yet included. Only a small speedup, but better than nothing.
Denoise of D800 NEF with default Denoise settings gets a speedup of about 5% at my
system.

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-02 13:51:17


Beep6581 commented 9 years ago
Hi Ingo, now for D800e it is 7750 ms and no artifacts:)
(it was 9170ms before, comment #3)

Reported by michaelezra000 on 2014-03-02 18:26:20

Beep6581 commented 9 years ago
Hi Michael, thanks for your test. Will post next patch with some additional vectorization
for enhanced mode in a few hours.

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-02 18:55:56

Beep6581 commented 9 years ago
Just realized, that the function header for Win32 has been there before my patch. So
it can also be tested with Win32.

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-02 19:21:29

Beep6581 commented 9 years ago
A few hours, as mentioned in #30 will turn to at least one day, because I'm lazy.....

Reported by heckflosse@i-weyrich.de on 2014-03-03 00:24:59

Beep6581 commented 9 years ago
Ingo, you are saying years of CPU time, whats a few extra hours:)

Reported by michaelezra000 on 2014-03-03 01:58:25

Beep6581 commented 9 years ago

Reported by entertheyoni on 2014-03-03 11:55:03

Beep6581 commented 9 years ago
Here's the next patch. Didn't include the vectorization for enhanced mode, but made
some more general opts. Vectorization for enhanced mode follows later.

1.) Optimized MadMax. It's now called Mad, because Max wasn't used.

2.) Reduced the number of memory allocations in RGBtile_denoise and boxabsblur by a
big amount. Before patch there have been 166850 allocations and deallocations of a
16 kb block caused by this two functions to denoise a D800 NEF. Now there are only
t, where t is the number of threads used by denoise (t is the same as the number of
cores when denoising a complete image).

At my system, that leads to another 5% speedup compared to last patch.

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-03 22:42:30


Beep6581 commented 9 years ago
Ingo, with iuuse1971_01.patch I get 7620 ms

Reported by michaelezra000 on 2014-03-04 02:30:51

Beep6581 commented 9 years ago

Reported by heckflosse@i-weyrich.de on 2014-03-04 23:10:53

Beep6581 commented 9 years ago
Timings on C2D E8400@3Ghz 4GB (3GB for programs), L1 data 2x32KB 8way, L1inst 2x32KB
8way, L2 6144 KB 24way.

Run on the "denoise.nef" with edit panel on - save tiff . Each time I waited for disk
activity to settle down, that's why I didn't use queue.
On all runs some disk activity detected during denoise ..

denoise_test.patch 50.5 sec (average of 7/8 runs, 4th was an outlier)- best: 50.0sec

issue1971_00.patch 43.5 sec (average of 7 runs, no outliers)- best: 43.2 
issue1971_01.patch 41.8 sec (average of 6 runs (2x3) no outliers) - best 41.0

With issue1971_01 I had on both 2 sessions crashes on the 4th repeat each time. The
3 first repeats run fine but during the 4th RT crashed. No crashes with the other patches.
Console message was ..
"terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc"

Branch: default
Version: 4.0.12.63 + iuuse1971_01.patch
Changeset: c875b60b406a
Compiler: gcc 4.6.1
Processor: generic x86
System: Windows
Bit depth: 32 bits
Gtkmm: V2.22.0
Build type: Release
Build flags:  -mthreads -mtune=generic -fopenmp -Werror=unknown-pragmas -mwindows -DNDEBUG
-msse2 -mfpmath=sse -O3
Link flags:  -mthreads -static-libgcc -Wl,--large-address-aware,--verbose  -mtune=generic
-mwindows -s -O3
OpenMP support: ON
MMAP support: ON

Reported by iliasgiarimis on 2014-03-05 19:56:54

Beep6581 commented 9 years ago
Hi Ilias, thanks for your test.

"terminate called after throwing an instance of 'std::bad_alloc' means, that new couldn't
allocate more memory. That's strange, because the changes made in iuuse1971_01.patch
can increase the max needed memory only by
2 * c * 64 * 64 * 4 + c * 65536 * 4 bytes, which is for your cpu (c=2) 589824 bytes

But I will have a look, whether I made an error somewhere

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-05 20:21:03

Beep6581 commented 9 years ago
Ilias, if the bad_alloc with iuuse1971_01.patch stays reproducible, I'll make the SSE-opts
for enhanced mode based on issue1971_00.patch.

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-05 21:06:37

Beep6581 commented 9 years ago
Ilias, can you try iuuse1971_01.patch with a smaller picture than DENOISE.NEF ? D800
files really need a lot of memory.

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-05 21:42:48

Beep6581 commented 9 years ago
I had again the same crash with 12Mp file http://rawtherapee.com/forum/viewtopic.php?f=1&t=5322
but I had extreme settings and I don't remember after how many iterations it happened.

Starting a test with 12Mp and the same pp3 as asked in #0 ..

Reported by iliasgiarimis on 2014-03-05 22:23:06

Beep6581 commented 9 years ago
With 12Mp file RT didn't crash but it's a matter of time to reach the limit. 

RT's base memory consumption increases around 90MB after every iteration with 12Mp
file. 

With 36Mp file it increases around 270MB after each iteration. I read a peak at 2848MB
during the usual crash during 4th iteration ..

Should I try with 2277 patch ?

Reported by iliasgiarimis on 2014-03-05 23:11:35

Beep6581 commented 9 years ago
The memory leaks fixed in  Issue 2277  aren't so big that they could need 270 MB for
each iteration. Will have a look now, whether it behaves same here.

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-05 23:29:37

Beep6581 commented 9 years ago
Ilias, there is something wrong with iuuse1971_01.patch. Will find it. Thank you very
much for your test!!!

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-05 23:33:02

Beep6581 commented 9 years ago
Ilias, Here's the fix for the crash.

Reported by heckflosse@i-weyrich.de on 2014-03-06 10:46:39


Beep6581 commented 9 years ago
Ingo THANKS !!!

issue1971_02.patch is stable and works better. There is no memory leak anymore, and
the speed improved .. 

On the denoise.nef (36Mp) The average time of six iterations was 41.0sec with the best
at 40.6sec
This with the task manager open ..

Reported by iliasgiarimis on 2014-03-06 16:37:00

Beep6581 commented 9 years ago
Been using issue1971_02.patch and no issues, no difference in saved images.

Reported by entertheyoni on 2014-03-07 12:30:15

Beep6581 commented 9 years ago
Here's the patch I would like to commit. It includes optimizations for enhanced mode.

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-12 14:30:37


Beep6581 commented 9 years ago
Tested patch 3 on a set of noisy D700, applied "Default ISO High" and got this:

RT before (ms)          7 324
RT after patch_3 (ms)   6 369

A nice 13 % decrease! Great Ingo!

Reported by johan@birkagatan.com on 2014-03-12 18:42:27