Closed Beep6581 closed 9 years ago
Will do that this w.e.
Reported by natureh.510
on 2013-09-12 23:29:28
Hi Ingo, the patch file seems to include only the StopWatch and no optimizations, is
this intentional? Did you just want to get the understanding of the "before" state
on various architectures?
Reported by michaelezra000
on 2013-09-13 02:22:22
on i7 3930K CPU @3.2GHz - 9170ms
Reported by michaelezra000
on 2013-09-13 02:49:06
This 50mm F1.8 lens is a little marvel, I have to say!
Reported by michaelezra000
on 2013-09-13 03:00:12
Hello Ingo
On i7 3630 CPU 2.4GHz ==> 12255 ms
:)
Reported by jdesmis
on 2013-09-13 05:24:46
Ingo, will try tonight!
Reported by johan@birkagatan.com
on 2013-09-13 05:32:22
@ #2: Michael, yes!
Reported by heckflosse@i-weyrich.de
on 2013-09-13 06:10:14
I wanted to chime and contribute,since Ingo said it would have been interesting to collect
data on lower-end hardware too,so here is my result:
On Intel i3-2100 CPU @ 3.10GHz --> RGB_denoise took 27130ms
Reported by msth67
on 2013-09-13 10:48:55
On i7-2670QM @ 2.2GHz (but automatically over-clocked @ 2.8GHz) -> RGB_denoise took
~15300ms
Reported by natureh.510
on 2013-09-13 14:54:51
AMD Athlon IIX4 640 Processor 3,00 GHz; RGB_denoise took 34606ms
Reported by willemtermeer
on 2013-09-13 15:42:18
At this point, i think that posting some build informations may be useful:
Branch: default
Compiler: gcc 4.6.1
Processor: generic x86
System: Windows
Bit depth: 64 bits
Build type: Release
Build flags: -mthreads -mwin32 -m64 -msse2 -mfpmath=sse -mtune=generic -fopenmp -O3
Link flags: -mwin32 -m64 -mthreads -static -mtune=generic -s -O3
OpenMP support: ON
Just for fun, I'll try that on the Linux Mint OS (dual boot), with more up-to-date
Gcc but with the very same PC, of course.
Reported by natureh.510
on 2013-09-13 15:47:35
Hello, i7 2600K: 10 586 ms
Reported by johan@birkagatan.com
on 2013-09-13 15:51:19
Hi, and my results are:
RGB_denoise took 25482ms (avarage of 3 run's)
Buildinfo (untouched standard generated flags)
Branch: default
Version: 4.0.11.48
Compiler: cc 4.7.3
Processor: AMD Phenom2 x4@3.4GHz
System: Linux
Bit depth: 64 bits
Gtkmm: V2.24.2
Build type: Release
Build flags: -fopenmp -O3 -DNDEBUG
Link flags:
OpenMP support: ON
MMAP support: ON
Reported by reine.edvardsson
on 2013-09-13 17:09:54
Thank you very much!!! Now I'm almost sure, that there is one of the typical 4-way-associative
L1-cache problems. I have to solve that first, because without solving it, all my other
tries to speed up Denoise will be shadowed by this problem.
Ingo
Reported by heckflosse@i-weyrich.de
on 2013-09-13 22:04:54
Because I got the necessary responses to my question, I removed the NEF from my webspace.
Thanks a lot to all who helped me with this test :-)
Ingo
Reported by heckflosse@i-weyrich.de
on 2013-09-13 23:11:37
Just for fun, here is the numbers for the Linux platform with the same laptop (i7-2670QM
@ 2.2GHz, but automatically over-clocked @ 2.8GHz on Windows):
------------ ( comment 9 & 11 ) -------------------------------
Branch: default
Compiler: gcc 4.6.1
Processor: generic x86
System: Windows7
Bit depth: 64 bits
Build type: Release
Build flags: -mthreads -mwin32 -m64 -msse2 -mfpmath=sse -mtune=generic -fopenmp -O3
Link flags: -mwin32 -m64 -mthreads -static -mtune=generic -s -O3
OpenMP support: ON
-> RGB_denoise took ~15300ms
---------------------------------------------------------------
Branch: default
Compiler: gcc 4.7.3
Processor: generic x86
System: Linux Mint 15
Bit depth: 64 bits
Gtkmm: V2.24.2
Build type: Release
Build flags: -m64 -msse2 -mfpmath=sse -mtune=generic -fopenmp -DNDEBUG -O3 -funroll-loops
Link flags: -m64 -mtune=generic -s -O3
OpenMP support: ON
-> RGB_denoise took ~14400ms (5.88% less than Windows, quite nice :) )
---------------------------------------------------------------
Reported by natureh.510
on 2013-09-18 23:05:00
There are some differences in code-generation of gcc between Linux 64 bit and Win 64
bit. One is that on Linux up to 14 registers can be used to pass parameters to a function
and with Win only 8 can be used but only for the first 4 parameters so it's limited
to 4 on Win.
http://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions
Ingo
Reported by heckflosse@i-weyrich.de
on 2013-09-19 21:15:18
We can get some speedup in denoise-preview by changing the var tilesize of RGB_denoise(..)
from the actual value of 1024 to 512 or 256. With tilesize 1024 the number of used
cores is limited by the size of the preview area (e.g. 2 cores at my system). Changing
the tilesize to a lower value allows usage of more cores (and so faster processing)
in preview. If somebody wants to test, just place a StopWatch in RGB_denoise and change
the var tilesize (NOT TS) to one of the above mentioned values.
Ingo
Reported by heckflosse@i-weyrich.de
on 2014-01-05 20:08:18
Hi Ingo, what about variable overlap? should that be lowered correspondingly?
E.g.:
const int tilesize = 512;
const int overlap = 64;
Reported by michaelezra000
on 2014-01-05 20:35:06
I played with both, fastest for preview was tilesize 256 and overlap 32.
Reported by heckflosse@i-weyrich.de
on 2014-01-05 20:58:47
Timing on my i7 D800e with chroma NR=15:
1024/128: full-render - 9100 ms; preview 864 ms
256/ 32: full-render - 7320 ms; preview 540 ms
However, when comparing images before and after, I see very prominent checkerboard
pattern of tiles.
Reported by michaelezra000
on 2014-01-05 21:29:42
Of course. I meant to use this values only for preview (at least in first step). For
full processing, we could use the old ones, if they are better.
Reported by heckflosse@i-weyrich.de
on 2014-01-05 21:33:18
It is puzzling though, why this causes such difference in output contents.
could be a threading issue? Now that more cores are engaged?
Reported by michaelezra000
on 2014-01-05 21:35:55
Output tiles are feathered together (line 625 ...) so there are differences when changing
the size of tiles.
Reported by heckflosse@i-weyrich.de
on 2014-01-05 21:51:51
Ingo please see issue 1458, just FYI.
Reported by entertheyoni
on 2014-01-05 23:25:07
DrSlony, could you point me to the relevant comment of this issue, please?
Reported by heckflosse@i-weyrich.de
on 2014-01-05 23:29:43
Found it.
Reported by heckflosse@i-weyrich.de
on 2014-01-06 11:08:07
Vectorized another part of denoise. Please don't test on Win32. The SSE function headers
for Win32 are not yet included. Only a small speedup, but better than nothing.
Denoise of D800 NEF with default Denoise settings gets a speedup of about 5% at my
system.
Ingo
Reported by heckflosse@i-weyrich.de
on 2014-03-02 13:51:17
PatchSubmitted
Hi Ingo, now for D800e it is 7750 ms and no artifacts:)
(it was 9170ms before, comment #3)
Reported by michaelezra000
on 2014-03-02 18:26:20
Hi Michael, thanks for your test. Will post next patch with some additional vectorization
for enhanced mode in a few hours.
Ingo
Reported by heckflosse@i-weyrich.de
on 2014-03-02 18:55:56
Just realized, that the function header for Win32 has been there before my patch. So
it can also be tested with Win32.
Ingo
Reported by heckflosse@i-weyrich.de
on 2014-03-02 19:21:29
A few hours, as mentioned in #30 will turn to at least one day, because I'm lazy.....
Reported by heckflosse@i-weyrich.de
on 2014-03-03 00:24:59
Ingo, you are saying years of CPU time, whats a few extra hours:)
Reported by michaelezra000
on 2014-03-03 01:58:25
Reported by entertheyoni
on 2014-03-03 11:55:03
Here's the next patch. Didn't include the vectorization for enhanced mode, but made
some more general opts. Vectorization for enhanced mode follows later.
1.) Optimized MadMax. It's now called Mad, because Max wasn't used.
2.) Reduced the number of memory allocations in RGBtile_denoise and boxabsblur by a
big amount. Before patch there have been 166850 allocations and deallocations of a
16 kb block caused by this two functions to denoise a D800 NEF. Now there are only
t, where t is the number of threads used by denoise (t is the same as the number of
cores when denoising a complete image).
At my system, that leads to another 5% speedup compared to last patch.
Ingo
Reported by heckflosse@i-weyrich.de
on 2014-03-03 22:42:30
Ingo, with iuuse1971_01.patch I get 7620 ms
Reported by michaelezra000
on 2014-03-04 02:30:51
Reported by heckflosse@i-weyrich.de
on 2014-03-04 23:10:53
Timings on C2D E8400@3Ghz 4GB (3GB for programs), L1 data 2x32KB 8way, L1inst 2x32KB
8way, L2 6144 KB 24way.
Run on the "denoise.nef" with edit panel on - save tiff . Each time I waited for disk
activity to settle down, that's why I didn't use queue.
On all runs some disk activity detected during denoise ..
denoise_test.patch 50.5 sec (average of 7/8 runs, 4th was an outlier)- best: 50.0sec
issue1971_00.patch 43.5 sec (average of 7 runs, no outliers)- best: 43.2
issue1971_01.patch 41.8 sec (average of 6 runs (2x3) no outliers) - best 41.0
With issue1971_01 I had on both 2 sessions crashes on the 4th repeat each time. The
3 first repeats run fine but during the 4th RT crashed. No crashes with the other patches.
Console message was ..
"terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc"
Branch: default
Version: 4.0.12.63 + iuuse1971_01.patch
Changeset: c875b60b406a
Compiler: gcc 4.6.1
Processor: generic x86
System: Windows
Bit depth: 32 bits
Gtkmm: V2.22.0
Build type: Release
Build flags: -mthreads -mtune=generic -fopenmp -Werror=unknown-pragmas -mwindows -DNDEBUG
-msse2 -mfpmath=sse -O3
Link flags: -mthreads -static-libgcc -Wl,--large-address-aware,--verbose -mtune=generic
-mwindows -s -O3
OpenMP support: ON
MMAP support: ON
Reported by iliasgiarimis
on 2014-03-05 19:56:54
Hi Ilias, thanks for your test.
"terminate called after throwing an instance of 'std::bad_alloc' means, that new couldn't
allocate more memory. That's strange, because the changes made in iuuse1971_01.patch
can increase the max needed memory only by
2 * c * 64 * 64 * 4 + c * 65536 * 4 bytes, which is for your cpu (c=2) 589824 bytes
But I will have a look, whether I made an error somewhere
Ingo
Reported by heckflosse@i-weyrich.de
on 2014-03-05 20:21:03
Ilias, if the bad_alloc with iuuse1971_01.patch stays reproducible, I'll make the SSE-opts
for enhanced mode based on issue1971_00.patch.
Ingo
Reported by heckflosse@i-weyrich.de
on 2014-03-05 21:06:37
Ilias, can you try iuuse1971_01.patch with a smaller picture than DENOISE.NEF ? D800
files really need a lot of memory.
Ingo
Reported by heckflosse@i-weyrich.de
on 2014-03-05 21:42:48
I had again the same crash with 12Mp file http://rawtherapee.com/forum/viewtopic.php?f=1&t=5322
but I had extreme settings and I don't remember after how many iterations it happened.
Starting a test with 12Mp and the same pp3 as asked in #0 ..
Reported by iliasgiarimis
on 2014-03-05 22:23:06
With 12Mp file RT didn't crash but it's a matter of time to reach the limit.
RT's base memory consumption increases around 90MB after every iteration with 12Mp
file.
With 36Mp file it increases around 270MB after each iteration. I read a peak at 2848MB
during the usual crash during 4th iteration ..
Should I try with 2277 patch ?
Reported by iliasgiarimis
on 2014-03-05 23:11:35
The memory leaks fixed in Issue 2277 aren't so big that they could need 270 MB for
each iteration. Will have a look now, whether it behaves same here.
Ingo
Reported by heckflosse@i-weyrich.de
on 2014-03-05 23:29:37
Ilias, there is something wrong with iuuse1971_01.patch. Will find it. Thank you very
much for your test!!!
Ingo
Reported by heckflosse@i-weyrich.de
on 2014-03-05 23:33:02
Ilias, Here's the fix for the crash.
Reported by heckflosse@i-weyrich.de
on 2014-03-06 10:46:39
Ingo THANKS !!!
issue1971_02.patch is stable and works better. There is no memory leak anymore, and
the speed improved ..
On the denoise.nef (36Mp) The average time of six iterations was 41.0sec with the best
at 40.6sec
This with the task manager open ..
Reported by iliasgiarimis
on 2014-03-06 16:37:00
Been using issue1971_02.patch and no issues, no difference in saved images.
Reported by entertheyoni
on 2014-03-07 12:30:15
Here's the patch I would like to commit. It includes optimizations for enhanced mode.
Ingo
Reported by heckflosse@i-weyrich.de
on 2014-03-12 14:30:37
Tested patch 3 on a set of noisy D700, applied "Default ISO High" and got this:
RT before (ms) 7 324
RT after patch_3 (ms) 6 369
A nice 13 % decrease! Great Ingo!
Reported by johan@birkagatan.com
on 2014-03-12 18:42:27
Originally reported on Google Code with ID 1971
Reported by
heckflosse@i-weyrich.de
on 2013-09-12 20:31:04