Beep6581 / RawTherapee

A powerful cross-platform raw photo processing program
https://rawtherapee.com
GNU General Public License v3.0
2.95k stars 327 forks source link

LMMSE demosaicing - speed #1786

Closed Beep6581 closed 9 years ago

Beep6581 commented 9 years ago

Originally reported on Google Code with ID 1802

Here is a patch that improves the speed of LMMSE demosaicing

I checked with "Photoshop" - layer difference - on multiple hard images, no differences.

estimated speed gain: 50%

eg : D800 noisy

Without patch : 10 seconds
With patch : 5 seconds

Reported by jdesmis on 2013-03-25 12:39:29


Beep6581 commented 9 years ago

Reported by entertheyoni on 2013-03-25 14:28:03

Beep6581 commented 9 years ago
Just my curiosity .. isn't xdiv2f(xdiv2f(n)) faster than 0.25*n ??.

Reported by iliasgiarimis on 2013-03-25 15:22:36

Beep6581 commented 9 years ago
Not necessarily. But we can make a xdiv4f, which is as fast as xdiv2f

xdiv4f

if (*(int*)&f & 0x7FFFFFFF) { // if f==0 do nothing
    *(int*)&f -= 2 << 23; // add n to the exponent
}

The original code is xdivf( float f, int n)
if (*(int*)&f & 0x7FFFFFFF) { // if f==0 do nothing
    *(int*)&f -= n << 23; // add n to the exponent
}

which divides f by 2^n

but then we have to write xdivf( x, 1 ) to divide by two and xdivf( x, 2 ) to divide
by 4, which decreases readability 

Reported by heckflosse@i-weyrich.de on 2013-03-25 15:37:53

Beep6581 commented 9 years ago
Why not, the readability cost looks acceptable (to me) if the speed gain will be substantial
..

Reported by iliasgiarimis on 2013-03-25 16:22:47

Beep6581 commented 9 years ago
Why not ?

but the biggest gain is in OMP  

10 sec without patch (average of 4 to 5 tests  on "D800 noisy file")
5.1 sec with "OMP"
5 sec with "xdiv2f" (and flags for compilation)

but perhaps Ingo can still improve OMP  ??

:)

Reported by jdesmis on 2013-03-25 16:31:33

Beep6581 commented 9 years ago
OK, sorry if i disturb you guys.

5.1 vs 5.0 is 2% diff .. not substantial. But maybe it's more for my old 2 core CPU
.. 

Reported by iliasgiarimis on 2013-03-25 16:56:50

Beep6581 commented 9 years ago
Hello Ilias

You don't disturb us.

Any suggestion is welcome to improve RT

:)

Reported by jdesmis on 2013-03-25 17:11:42

Beep6581 commented 9 years ago
You're absolutely right with the 2%. I made the xdiv2f as I tried to speed up amaze
a bit, which isn't easy. But actually I've a gain of 18% by making many small changes,
which all are in the 2%-range (one of them is using xmul2f and xdiv2f), but sum up
to 18%. When I reach 20%, I'll make a patch for amaze.

Ingo

Reported by heckflosse@i-weyrich.de on 2013-03-25 17:34:35

Beep6581 commented 9 years ago
I just wanted to mention that when you do these benchmarks you should remember to turn
off your CPU frequency governing daemon. It should set max CPU speed when needed, but
it might still throttle it down as the temperature rises, or misinterpret a momentary
dip in processing for permission to throttle down. When I turn off cpufreqd and then
manually set each core to max speed, RT benchmarks have a much higher accuracy and
precision, to within hundredths of a second. With cpufreqd running (even if I use the
"userspace" governor) precision drops, with differences between each run of even over
a second.

Reported by entertheyoni on 2013-03-25 17:59:58

Beep6581 commented 9 years ago
There is no such daemon at my system. Energy saving is also disabled. Overheating is
also no problem. I get very consistent results by always processing 5 pics, delete
the fastest and slowest time and average the remaining 3 times.

Reported by heckflosse@i-weyrich.de on 2013-03-25 20:18:52

Beep6581 commented 9 years ago
Hmm supposing xdivf( x, n ) exists .. 

I am thinking that then the coder can change all coeffs in

300 f[0]=1.0f/(1.0f+2.0f*fabs(1.125f*pix[-v][c]-0.875f*pix[0][c]-0.250f*pix[-x][c])+fabs(0.875f*pix[u][1]-1.125f*pix[-u][1]+0.250f*pix[-w][1])+fabs(0.875f*pix[-w][1]-1.125f*pix[-u][1]+0.250f*pix[-y][1])

with fast divisions plus additions (fast anyway) when needed .. for example 1.125f*
to x + xdivf(x, 3) ... 0.875f* to xvidf(x, 3) - x ... 

Is this correct ??.
Any benefit from using xfabsf instead of fabs ??.

There are ten float coeffs. Jacques's patch changed only one and gave a 2% speedup,
how much can we expect from changing all possible slow operators with sleef functions
??.  

Reported by iliasgiarimis on 2013-03-26 03:25:19

Beep6581 commented 9 years ago
Here is a slightly improved patch

I added "xdivf" to "sleef.c"

and in "demosaic_algos.cc" I added functions "include" for 0.0625, 0.25, ...

Ilias: the line you describe is not used in LMMSE but by "refinement" not used ...

The time saving is minimal, about 2%, but we can not go further, otherwise optimizing
(?) OMP.

:)

Reported by jdesmis on 2013-03-26 08:40:58


Beep6581 commented 9 years ago
Jacques, thanks.

I just saw that you also optimized these lines with xdiv2f/xmul2f and supposed that
there is a reason for this .. :)

Reported by iliasgiarimis on 2013-03-26 11:50:35

Beep6581 commented 9 years ago
@12 (xfabsf): Made some tries with xfabsf with different results. Sometimes it's faster,
sometimes slower. Wondered why. Then I detected, that at the part of source, where
I tried xfabsf, there've been mixtures of float-vars and double-constants (using double
constants in expressions normally leads the compiler to use doubles for the whole expression),
so we get an additional double->float conversion when using xfabsf, which eats all
the gain of possibly faster processing of xfabsf. First thing to do seems to be cleaning
the parts of the code, where floats and doubles are mixed. Registered, that Jacques
already began doing this.
Another thing: All the sleef-functions are declared inline to get maximum speed. But
by heavy use of this functions, we run into the inline-limits of gcc. If you compile
the engine with -Winline, you'll see, which inlines are not done. Then try to add --param
large-unit-insns=40000 --param inline-unit-growth=200, which quadruples the default
settings of gcc and see again. Code size is actually 40KB larger when using these switches.

Ingo

Reported by heckflosse@i-weyrich.de on 2013-03-26 21:07:45

Beep6581 commented 9 years ago
Ingo,

Sleef.c v2.60 is out http://freecode.com/projects/sleef

26 Mar 2013 12:51

Release Notes: This release added the remaining single precision functions: powf, sinhf,
coshf, tanhf, exp2f, exp10f, log10f, and log1pf. It also added support for FMA4 (for
AMD Bulldozer).

Reported by iliasgiarimis on 2013-03-27 00:49:09

Beep6581 commented 9 years ago
Ilias, thanks a lot for the information. I'll make the adaptations needed to use it
in RT.

Reported by heckflosse@i-weyrich.de on 2013-03-27 10:57:46

Beep6581 commented 9 years ago
I "improved" the patch.

In fact, the maximum time is consumed by the median which reduces some artifacts -
especially those related to noise.

After extensive testing on noisy images, I chose a basic setup with two passes for
the median instead of three. The time saving is about 20%.

I added a slider that allows you to vary the number of passes of median, zero to four
(default = 2)

On my machine with a noisy D800 file, I get processing time

4 passes = about 6 seconds
3 passes = about 5 seconds
2 passes = about 4 seconds (default)
1 pass = about 3,1 seconds
0 pass = about 2,2 seconds

Reported by jdesmis on 2013-03-28 10:53:47


Beep6581 commented 9 years ago
I made ​​other improvements:
* Reduce memory usage: 10%
* Update "enhancement"==>"LMMSE enhancement Steps":
   * steps 0==>3 : number of passes "median" 
   * steps 4 and 5 = median 3 + "refinement 1 or 2 passes" which improves signal to
noise ratio

Reported by jdesmis on 2013-03-29 06:38:22


Beep6581 commented 9 years ago
The same, with :
* slighty enhancement
* update "new" default

:)

Reported by jdesmis on 2013-03-29 11:52:11


Beep6581 commented 9 years ago
Jacques, let me express some thoughts

looks like you aborted any sleef optimization in the new refinement code (lines 419-440)
!. 
Is it much trouble for little benefit ??. Or not to "run into the inline-limits of
gcc" (as Ingo wrote) so save inlines for where they count most ??.

The name xdivf has a little resemblance with what it really does. I think xdiv2nf would
better describe its functionality.

Reported by iliasgiarimis on 2013-03-29 16:21:20

Beep6581 commented 9 years ago
Hello Ilias

I have done extensive testing with the functions "refinement" (artifacts, bad pixels,
noise, edges, etc.) ... The one you mention is not used (it is between /* */). In this
new patch I deleted.

I tried to optimize the use of the previous version of "refinement_lassus" (should
not be used with all demosaicing !)

I added the generic function "xdivf" to "sleef.c" and to facilitate the use and reading
I created several functions (demosaic_algos.cc).

(in "sleef.c")
__inline float xdivf( float d, int n){
    if (*(int*)&d & 0x7FFFFFFF) { // if f==0 do nothing
        *(int*)&d -= n << 23; // add n to the exponent
        }
    return d;
}

(in "demosaic_algos.cc")    
#define x1125(a) (a + xdivf(a, 3))  ==> 1.125f
#define x0875(a) (a - xdivf(a, 3)) ==> 0.875f
#define x0250(a) xdivf(a, 2)  ==>0.25f
#define x00625(a) xdivf(a, 4)  ==> 0.0625f
#define x0125(a) xdivf(a, 3) ==> 0.125 f

In addition I have slightly modified the slider "LMMSE enhancement steps"

0 = no gamma, no median, no refinement ==> minimize processing time
1 = gamma, no median, no refinement => as above but the "gamma" (sRGB) changes the
response
2, 3, 4 = gamma , 1 or 2 or 3 median, no refinement
5,6 = gamma, 3 median, 1 or 2 refinement  ==>  maximum processing time

:)

Reported by jdesmis on 2013-03-29 16:54:11


Beep6581 commented 9 years ago
Jacques , thanks for the explanations.

Am I blind ? ..once again missed /* */ .. sorry

Although I had already spotted your work at sleef optimizations (not that blind here)
it's easily readable even by me .. nice-tidy work.

One last question. If I am not wrong (again) bilinear is used for color interpolation.
May be its the cause for a part of the artifacts as its a bad algo regarding aliasing-moire
(see bilinear demosaic). 
Did you and/or Luis researched for a better algo for this task ??. As Luis wrote at
the forum "if we need refinement after demosaic something is behaving badly during
demosaic" and it would be better to correct it in the root.  

Reported by iliasgiarimis on 2013-03-29 20:03:26

Beep6581 commented 9 years ago
Completely out of topic: We take advantage of existing sleef-functions in RT and also
add new functions to sleef.c. Maybe we should coordinate our work with the author of
sleef, so it would be easier to use his updates. Maybe he's also interested to integrate
the xdiv/xmul-functions in his library? Maybe he's interested in posting, that RT uses
his functions? He did it for other projects (see the news of http://shibatch.sourceforge.net/
). How about this?

Ingo, back in one week...

Reported by heckflosse@i-weyrich.de on 2013-03-29 23:15:00

Beep6581 commented 9 years ago
Communication is never a bad thing!

Reported by entertheyoni on 2013-03-30 01:58:36

Beep6581 commented 9 years ago
cat /tmp/lmmse.pp3 
[RAW]
CcSteps=0
Method=lmmse
LMMSEIterations=2

Before patch:
Build flags:  -march=native -fopenmp -O3 -DNDEBUG
Benchmark "lmmse.pp3" average...................... 6.875

After patch:
Build flags:  -march=native -fopenmp -O2 -DNDEBUG (this makes no difference, but I'd
rather mention it)
Benchmark "lmmse.pp3" average...................... 3.308

Great!

Reported by entertheyoni on 2013-03-30 02:04:46

Beep6581 commented 9 years ago
@ Ingo
Totally agree with you. Can you fill this role, since it was you who started this work.

@ entertheyoni:
thank you for these tests

@ Ilias
Yes there is bilinear, but mixed with the algorithm "directional Linear Minimum Mean
Square Estimation-error" which makes it a very good algorithm ... which has nothing
to do with a simple bilinear

LMMSE is good for solving the "moiré" see the image of the lighthouse http://filebin.net/5dctq1oxr0
on the test file and my point of view than enough for noisy images.
For refinement, it is the view of Luis, but:
* Does not exist the perfect demosaicing
* in RT look at DCB...you can see "DCB enhancement steps" which is a refinement
* why they added "False color suppression steps"
* why do we asked IGV and LMMSE for noisy images
* etc.

and the ranking of interpolations is based on his own evaluation criteria:
* Processing time
* Bad pixels
* False colors
* Ability to reproduce the details
* Moire (Nyquist)
* Noisy images
* Ability to restore the diagonals
* Noise attenuation
* Signal to noise ratio (PSNR)
* chromatic aberration
* mazes
* DeltaE: colors respect
* edges
* Etc..
* Etc.

The main drawback of LMMSE is to generate edges "rough", but its main advantage is
to provide plans (solids) "thin" for noisy images (avoids pseudo mazes)

For the majority of images AmAze  (thanks to Emil) does a very very good job.

If you look at the code LMMSE :
* If you put "LMMSE enhancement steps" to zero, you have the original code (matlab)
by L. Zhang and X. Wu
If we put 4 the code is made by a change (median + gamma) - I think this is from Manuel
(??)

I added Refinement because I think for some images, the algorithm "LMMSE" weakens some
pixels and the purpose of refinement is to restore strength and improve the PSNR
Of course I do not attach importance to this addition, if users do not find interest
can delete.

I could have instead of the slider, 3 controls:
* checbox for gamma
* slider (3 positions) for median
* slider (2 positions) for refinement
But it's complex...I prefer one...at least for testing

From my point of view if we want to simplify, you must remove the slider and only keep
improving "gamma" ... We win processing time and result before denoising is really
good
For a D800 noisy :
* LMMSE without median and refinement ==> 2 seconds
* one pass of median ==> 1 second
* one pass of refinement ==> 1 second 
* LMMSE as in Dcraw (without OMP) ==> 10 seconds  (gamma + 3 median) 

But if you want to keep the improvements  (median, without gamma, refinement), then
the solution (one control) is it not the easiest ?

Reported by jdesmis on 2013-03-30 06:12:12

Beep6581 commented 9 years ago
Hello Jacques,

Wow .. It's an educative post .. and useful for an article in the manual ..

About demosaic algos .. one of the first priorities when developing an algorithm is
speed so sometimes the developers choose a solution which gives less quality but gain
speed. It's about balance.
Emil's last post here (in google code) mentions that he choose a suboptimal color interpolation
to gain speed and this is the first step he would give attention for a better Amaze2
now that we have Ingo to solve speed problems.

Your implementation of LMMSE controls looks perfect to me. And I think it will be easy
to understand and give better results for most users if it is properly documented in
the manual and the tips.

Indeed IGV and LMMSE give better results on noisy images. Thanks.

I am waiting for a win32 built to test the new version. As I understand from your writing
there are no artifacts from refinement in this case :) ..

Reported by iliasgiarimis on 2013-03-30 14:30:50

Beep6581 commented 9 years ago
Hi Jacquez, could you explain the meaning of gamma in "LMMSE enhancement steps" - I
am curious, what does it do? How would sRGB be different than ProPhoto - which would
be better for lower noise demosaic?

Reported by michaelezra000 on 2013-03-30 14:53:45

Beep6581 commented 9 years ago
@ Ilias
Thank you for your review

@ Michael
Here we are not talking about color space (sRGB, Prophoto ..), but gamma. As you know
there are several types of gamma:
* Linear => gamma=1, does not change the data
* Simple logarithmic ==> gamma=2.2 or 1.8 etc..
* Consists of a)a linear portion and b) another logarithmic ==> eg "sRGB gamma" linear
until 12.92 then logarithmic 2.4
This is it (generally found in RT) which is applied before demosaicing, then its inverse
at the end.
This changes the appearance of "rawData" and will differentiate the action on the high
and low lights.

For example take the image I put in link "fileRGGB_sg.DNG"
* Look at the picture of the boat (sail) marked "14255" and try "LMMSE enhancement
step" to 0 and then to 1 (ie without gamma and with gamma)
* Also look the hats "Bahamas" and do the same test

The differences are in my view important

:)

Reported by jdesmis on 2013-03-31 05:48:59

Beep6581 commented 9 years ago
I put another "refinement", faster and less "aggressive" I think in some cases

Reported by jdesmis on 2013-03-31 09:24:00


Beep6581 commented 9 years ago
Michael's question about gamma is mine too.

I understand that this gamma thing is useful to adapt the intensity of any filtering
or the interpolation direction decisions, to the raw levels (more at the darks and
gradually decreasing at higher levels). It's the same logic under the gamma slider
at Emil's NR algo.

The question is .. is this sRGB gamma generally optimal or one could use another "response
courve" which could be better. For example the first courve I would try is the simple
logarithmic gamma=2.0 for its simplicity .. and in part because we can optimize it
better with sleef functions. :)
Since it is a copy of sRGB gamma encoding, it's not clear why this specific composite
curve (linear at darks and then log 2.4) is used on raw data. Is it tuned for specific
cases (noisy raws) or is it a generally averaged optimum by coincidence same as sRGB
gamma ??.

Thinking a little deeper ... i think this gamma regulation didn't came up magically.
It has to be related to 
a) some physical properties of the recorded (in the raw file) signal.
b) Or the properties of the Human Visual Perception. 
Or both of the previous ..

In "case a" the first signal property that looks as possible reason for a "variable
intensity" median filtering is noise.  Noise makes difficult the decision about interpolation
direction, effectiveness of Least Square Error calculations, etc .. Noise is more at
dark (underexposed) areas (low raw levels) and less on well exposed (high raw levels).

There are three main components of noise 
1. photon noise depends on exposure, Poison distribution .. stdev = sqrt(captured photons)
.. hmm a gamma=2 looks ideal here
2. PRNU Pixel Response Non Uniformity  expressed as percent of photons captured, typically
stdev =  0.3% - 1.0% * photons captured
3. Electric noise induced by electronic parts of camera. Independent of photons captured
...Ndark
The above three components added in quadrature give the total N = sqrt( Nphoton^2 +
Ndark^2 + PRNU^2).

http://www.dxomark.com/index.php/Publications/DxOMark-Insights/Noise-characterization/Summary

Copy from DxO .. useful as a rule of thumb for a desired curve ..
"Shadows (): the SNR increases 6dB for every EV and loses 6dB for each doubling of
the ISO setting.
Midtones (): the SNR increases 3dB for every EV and decreases by 3dB for each doubling
of the ISO setting.
Highlights (): the SNR is constant and does not depend on the ISO."

Another source of related information is Emil's http://theory.uchicago.edu/~ejm/pix/20d/tests/noise/
plus some post of him at Dpreview and LuLa. 
I hope he will find the time to comment on this, to correct possible errors and generally
put things in order.

Reported by iliasgiarimis on 2013-03-31 20:36:04

Beep6581 commented 9 years ago
@ Ilias

Good question :)

In fact it is not easy!
1) I am not the author of the code - I did that optimize
2) no it is not magic, in this case it is a)(your comment) ==> 99%
3) gamma can be or has been introduced around the dcraw/Perfectraw /RT code. For example,
we find:
a) in the first trials of amAze (Emil was then deleted)
b) in the code for "Denoise RT" (part Lab) - not accessible to the user.
c) in other parts of the code that I had developed for dcraw (denoise...)
d) etc..

We also used "pyramids" to adapt the action to the luminance in other cases.

For b) (FtblockDN.cc) I did many tests, but I still got to use a curve with variable
gamma (linear + logarithmic), because its action is different in the shadows.
The code "Raw Lab" is obtained with a linear gamma=11 logaritmic=2.6
The code "TIFF" is obtained with a sRGB gamma linear=12.92 logaritmic=2.4
But the differences are extremely small

But be careful to change the gamma for a more simple (2.2). We will win nothing (time),
because it uses Lutf already established in RT

In this case, I think - but I'll try in the day - that different gamma will bring minor
variations - and add a control will further complicate the use (and understand!)

:)

Reported by jdesmis on 2013-04-01 06:08:43

Beep6581 commented 9 years ago
I made several tests with different gamma.

Of course with Photoshop "layers difference" to 600% we see small differences.
What seems obvious the more gamma approximates the linear shadows (eg gamma BT709)
the more  you are closer to linearized...

To my taste, (but it is very subtle) rendering gamma sRGb is good ... better (??) a
little than gamma BT709 or 2.2.

I have the opportunity to make two cosmetic changes !

If no further comments I'll commit tomorrow 

:)

Reported by jdesmis on 2013-04-01 07:54:24

Beep6581 commented 9 years ago
About speed gain .. So it is again unused code (line131) .. :( ... I have to learn C++
elementaries ..

About gamma.

I did not say that sRGB gamma is wrong and g=2.0 is better in general (but it can be
on noiseless data). I just mentioned (partly answering to Michael) that there should
be a reason for a multipart curve. And the multipart noise curve (DxO link) proοves
this need. But perhaps we can optimize the curve shape based on measured data. 

If "case a" is valid then we have to adapt to heavily varying noise curves (depending
on pixel size, ISO, sensor technology) with a single curve even if it is multipart
..

If you want some SNR curves ( >100 data points each, from DxO ) ask and I can have
them within a day. Samples from 2-3 cameras X4-5 ISO about 15 curves  are enοugh if
selected properly.

Reported by iliasgiarimis on 2013-04-01 08:39:11

Beep6581 commented 9 years ago
I reworked the gamma ...
Firstly, we are interested by demosaicing and not directly by noise ... even if there
are interactions. As I said to the noise we have 2 gamma and inverse - one was available
to the user to better treatment

After many tests ... I chose a "gamma 2.4 17.0" which seems a little more efficient
for demosaicing that "gamma sRGB  (2.4 12.92)"

I also "cleaned" the code ... which causes a very small improvement.

Reported by jdesmis on 2013-04-02 07:12:48


Beep6581 commented 9 years ago
I haven't had time to test for quality, but it does work here, for whatever that's worth.
Gentoo 64.

Reported by entertheyoni on 2013-04-02 23:24:43

Beep6581 commented 9 years ago
Jacques, I examined the quality with lmmse_speed8.patch and it is very good on noisy
images! In particular, it is very effective in not producing large color blotches in
dark shadows.

Reported by michaelezra000 on 2013-04-03 00:58:56

Beep6581 commented 9 years ago
commit to default

Reported by jdesmis on 2013-04-03 05:29:20

Beep6581 commented 9 years ago
@27, Jacques. I'll do that next week!

Ingo

Reported by heckflosse@i-weyrich.de on 2013-04-06 21:46:42

Beep6581 commented 9 years ago
Contacted the author of sleef-library. He'll make a link to RT from his page, as soon
as the official versions of RT, which use his libraries, are released.

Ingo

Reported by heckflosse@i-weyrich.de on 2013-04-09 20:56:24

Beep6581 commented 9 years ago
Thank you for this information 

Jacques

Reported by jdesmis on 2013-04-10 05:02:51

Beep6581 commented 9 years ago

Reported by entertheyoni on 2014-11-06 12:53:11