Optimization of VNG4 demosaic (and converting to float calculations)

Beep6581 commented 9 years ago

Originally reported on Google Code with ID 2278

I use the VNG4 demosaicer from time to time. It has the advantage of being very robust
and still gives decent detail (not as good as amaze or dcb of course, but more detail
means less robust).

An example when VNG4 excels is when you have green channel variation for example due
to color crosstalk on the sensor (can happen on some technical ultra-wides on medium
format sensors, probably on leica rangefinders too or Sony A7(r) with legacy wides
via adapters), and afaik there are some cameras that have native green channel offsets
too. Most other demosaicers produce mazing artifacts in these situations.

Today the VNG4 algorithm is more or less a straight lift from dcraw, and it's not parallelized
so it takes forever to demosaic. I think the unique robustness properties of the VNG4
makes it an important demosaicer and it therefore should be decently optimized.

Reported by torger@ludd.ltu.se on 2014-03-06 08:48:16

Beep6581 commented 9 years ago

I'll have a look

Reported by heckflosse@i-weyrich.de on 2014-03-06 10:56:21

Status changed: Started

Beep6581 commented 9 years ago

Here's the first patch. Intention of this patch is not to speed up processing, but to
change some internal calculations from int to float. Of course this leads to differences
between before and after this patch. Also got a speedup from 11528 ms to 10010 ms for
a D800 NEF by this changes caused by reduced number of float <=> int conversions. Will
make further changes, when differences of this patch are accepted by the team.

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-06 21:59:26

Status changed: PatchSubmitted
Labels added: Performance

_Attachment: issue2278_00.patch_

Beep6581 commented 9 years ago

A tip when attaching patches, if you have a .txt file suffix instead of .patch you can
click on it directly in any browser to read it. With .patch most browsers just let
you download the file. So issue2278_00-patch.txt would be a more convenient filename
:)

Nothing in the VNG4 algorithm itself is locked to integer math, so making it floating
point seems like a good idea, I'm okay with the patch.

Reported by torger@ludd.ltu.se on 2014-03-07 08:41:26

Beep6581 commented 9 years ago

There are at least two downsides of .txt suffix for patch files.

1.) I have to choose 'all files (*.*)' when importing a patch using TortoiseHg
2.) I lose syntax highlighting in editor when I open a patch file with suffix .txt

Anyway, here's the next patch. I changed some other ints to floats. Now vng4 is fully
floating point. Also got an additional speedup. Processing time is now 9350 ms.

Parallelized version follows later today. Have to clean the code first. Processing
time of parallelized vng4 at my system is about 2000 ms.

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-07 13:20:12

_Attachment: issue2278_01_patch.txt_

Beep6581 commented 9 years ago

i missed this entry, so i won't drop my 'improvements' in this issue.
VNG4 is very useful when you want to work with high ISO noisy images. They cleanup
much better with denoise etc.

Reported by janrinze on 2014-03-07 13:56:13

Beep6581 commented 9 years ago

Great! No problem with .txt vs .patch, use .patch if it's better for your workflow,
I can download :-)

Reported by torger@ludd.ltu.se on 2014-03-07 17:23:04

Beep6581 commented 9 years ago

Sorry if I am off topic .. :)

Talking about VNG4 .. I recall some discussions about the first stage bilinear interpolation
it uses and that it would become a better detail resoving demosaicer if a better interpolation
was used there. 
If my memory serves .. it was Manuel Llorens (from PerfectRaw .. with Emil-Jacques
and others) who had in his TODO list a related improvement. Unfortunately his site
(www.rawness.es) is dead last year(s) .. 
His last presence I know is https://sites.google.com/site/manuelllorens/ and gives
a mail address there .. manuelllorens@gmail.com

Reported by iliasgiarimis on 2014-03-07 19:24:31

Beep6581 commented 9 years ago

Re #5, Jan: I would be glad if you could review my next patch :-)

Re #7, Ilias: As you know more details about the past than me, could you try to contact
him?

I frittered a bit with further optimizations instead of cleaning the code. So I need
at least one additional day before I can post the next patch. Actually the processing
time is below 1800 ms at my system :-)

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-07 23:07:45

Beep6581 commented 9 years ago

Reported by heckflosse@i-weyrich.de on 2014-03-07 23:17:25

Beep6581 commented 9 years ago

Ingo and Torger, thanks for doing this! Vng4 is indeed a nice demosaic algorithm, and
the speedup is most welcome.

Reported by johan@birkagatan.com on 2014-03-08 12:49:07

Beep6581 commented 9 years ago

Decided to post a patch with ugly code, so you can check for speed and differences in
output. D800 NEF now needs 1650 ms at my system. Detected no differences in output
compared to issue2278_01_patch.txt. Will clean the code before next patch.
Memory footprint of VNG4 changed a bit.

Before patch it was

x + 4 * 3 * 4 * width bytes

Now it is

x + c * 3 * 4 * width bytes

where x is constant and c is number of parallel threads.

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-08 19:07:07

_Attachment: issue2278_02.patch_

Beep6581 commented 9 years ago

Ingo, your latest patch does not apply clean on latest default branch.
Apparently you did a diff against an older version than what you submitted into default
branch.

Reported by janrinze on 2014-03-08 20:03:01

Beep6581 commented 9 years ago

When applied to the right patch level then it works quite well.
It seems pretty quick. No idea if the demosaicing is better/worse than before.
Any chance that bilateral filtering could improve the sharpness?

Reported by janrinze on 2014-03-08 20:37:58

Beep6581 commented 9 years ago

Sorry for the trouble. Patch from #11 should work when applied to revision 4b8f3cb02ef0.

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-08 21:30:07

Beep6581 commented 9 years ago

Here's the next one. Code is much simpler now and it needs c * 3 * 4 * width bytes less
memory than last one. Speed is the same at my system.

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-09 13:19:34

_Attachment: issue2278_03.patch_

Beep6581 commented 9 years ago

Great work as always Ingo! Below is average of 10 D700 NEF-files (patch 3 tested):

Before: 3 622 ms
After:    480 ms

Impressive!

Reported by johan@birkagatan.com on 2014-03-10 18:57:40

Beep6581 commented 9 years ago

Johan, thanks for your test. Do you have a new machine? It's faster than mine with D700
NEF. Will try to post a new patch with simpler code and a bit (only a few %) speedup
tomorrow (that will be the patch I want to commit). Last days weather was too good
to spend a lot of time at the keyboard ;-)

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-10 23:27:57

Beep6581 commented 9 years ago

Here's the one I want to commit. It's a bit faster than issue2278_03.patch. Now it needs
1580 ms for a D800 NEF at my system.

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-11 19:19:24

_Attachment: issue2278_04.patch_

Beep6581 commented 9 years ago

Hello Ingo, re #17: No, it's the same Intel 2600k I've had a while now. What's your
timings - out of interest?

Reported by johan@birkagatan.com on 2014-03-11 19:45:42

Beep6581 commented 9 years ago

Hi Johan, I forgot that you own an i7. Here it takes about 600 ms for a D700 NEF.

Reported by heckflosse@i-weyrich.de on 2014-03-11 19:52:00

Beep6581 commented 9 years ago

i7 CPU Q 820 @ 1.73GHz, Pentax K10D raws:
~4620ms without the patch, ~750ms with :)

Reported by entertheyoni on 2014-03-11 21:11:10

Beep6581 commented 9 years ago

DrSlony, thanks for the test.

Reported by heckflosse@i-weyrich.de on 2014-03-11 21:38:21

Beep6581 commented 9 years ago

Reported by heckflosse@i-weyrich.de on 2014-03-12 21:24:33

Blocking: #2016

Beep6581 commented 9 years ago

Ingo, D800e on 12 threads @3.2 GH
Before 12500ms, after 925 ms:)!!!

Reported by michaelezra000 on 2014-03-14 03:32:54

Beep6581 commented 9 years ago

Michael, thanks for the test.

Committed to revision 36699351cd8c

Reported by heckflosse@i-weyrich.de on 2014-03-14 12:23:09

Status changed: FixedPendingConfirmation

Beep6581 commented 9 years ago

Great work!

Concerning detail resolving, if we change the bilinear filtering algorithm to improve
green detail I don't think we should do it at the cost of robustness, and I suspect
(but don't know) that extracting more detail in the greens would indeed reduce robustness.

One of the key reasons I'm using VNG4 from time to time is that it renders good results
for files with green channel instability, which I get with some technical wide angle
lenses.

Reported by torger@ludd.ltu.se on 2014-03-15 09:43:08

Beep6581 commented 9 years ago

Built and tested the committed revision, seems to work fine with great speed improvement.
Others have tested too as seen in the log, setting to fixed.

Reported by torger@ludd.ltu.se on 2014-03-15 09:46:32

Status changed: Fixed

Beep6581 commented 9 years ago

Oh, just saw an issue, the signed short int *cp pointer should not be static in this
multithreaded program, right? Seems to be some inheritance from dcraw code.

Reported by torger@ludd.ltu.se on 2014-03-15 10:35:58

Beep6581 commented 9 years ago

Anders, you're right. Missed that thing. Will correct it.

Reported by heckflosse@i-weyrich.de on 2014-03-15 10:55:57

Beep6581 commented 9 years ago

Removed that static declaration completely because it doesn't make sense to hold this
few values in memory.

Reported by heckflosse@i-weyrich.de on 2014-03-15 12:12:01

Beep6581 commented 9 years ago

I also noted that the first bilinear step skips one pixel around the edge (so I suppose
the end result will have one pixel line around which is not demosaiced), but I guess
we don't care as we crop away a few pixels around the border anyway.

Reported by torger@ludd.ltu.se on 2014-03-15 15:42:35

Beep6581 commented 9 years ago

Exactly.

Reported by heckflosse@i-weyrich.de on 2014-03-15 15:55:04

Beep6581 commented 9 years ago

Would it be better to do

    pix[ip[0]] = sum[ip[0]] / csum[row & 15][col & 15][i];

as

    pix[ip[0]] = sum[ip[0]] * csum[row & 15][col & 15][i];

in the bilinear filtering step (ie use multiplication instead of division)? As the
constant table csum is nowadays float so it can be set to

        csum[row][col][colcount] = 1.0 / sum[c];

instead of the old

        csum[row][col][colcount] = sum[c];

Reported by torger@ludd.ltu.se on 2014-03-27 11:58:43

Beep6581 commented 9 years ago

This allocation is also a bit suspicious, a dcraw inheritance I suspect.

    int * ip = (int *) calloc ((prow+1)*(pcol+1), 1280);

It assumes some size of int, hopefully 8 bytes, but I'm not so sure. In other code
(I have not verified it in raw therapee) when I've used dcraw-inspired vng4 code I've
had memory overruns on this buffer. From the loop under it's hard to figure out exactly
how big the buffer needs to be. In that case I was lazy and made it a lot bigger (and
related it to sizeof(int)).

Reported by torger@ludd.ltu.se on 2014-03-27 12:06:19

Beep6581 commented 9 years ago

Re #33: I tried your suggestion. On my (memory-bandwith-limited) system, I get a speedup
of 14 ms by changing this calulation using a D800 NEF. That's less than 1%.

Re #34: Yes, it's from dcraw. Will have a look.

Reported by heckflosse@i-weyrich.de on 2014-03-27 12:37:55

Beep6581 commented 9 years ago

Re #34, Anders: I had a look at the allocation. It's really hard to figure out how big
the buffer needs to be. But I ran it through valgrind for soma files and got no invalid
reads or writes.

Ingo

Reported by heckflosse@i-weyrich.de on 2014-03-27 22:20:11

Beep6581 commented 9 years ago

s/soma/some/ :-)

Reported by heckflosse@i-weyrich.de on 2014-03-27 22:23:02

Beep6581 commented 9 years ago

Ok... sorry for having made you spend time on it, I was genuinely worried though...

Reported by torger@ludd.ltu.se on 2014-03-28 07:42:51

Beep6581 / RawTherapee

Optimization of VNG4 demosaic (and converting to float calculations) #2262