Beep6581 / RawTherapee

A powerful cross-platform raw photo processing program
https://rawtherapee.com
GNU General Public License v3.0
2.84k stars 321 forks source link

Speedup for LMMSE demosaic #2648

Closed Beep6581 closed 9 years ago

Beep6581 commented 9 years ago

Originally reported on Google Code with ID 2665

Opened this Issue to get a new Issue number before I start my work at this Issue.
As usual for demosaic speedups I'll make a series of patches, small steps.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-11 21:06:05

Beep6581 commented 9 years ago
Ilias, at this step of the pipeline there is no difference in memory usage between edit
mode and queue (except the amount of memory needed for the gui of editor).
The reductions for queue mode are after demosaic.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-22 13:25:23

Beep6581 commented 9 years ago
Isn't this memory used by the gui around 500+ MBs ? Isn't it extra free space for use
by demosaic algorithms when we use queue ?.  

Reported by iliasgiarimis on 2015-02-22 14:16:20

Beep6581 commented 9 years ago
This patch reduces peak memory usage by (width+20)*(height+20)*8 bytes, but only for
the case there's not enough memory.

The allocation of buffers for red, green and blue is now inside the lmmse demosaic
function (only for lmmse of course). That means, processing time of this function now
also includes the time to allocate this buffers. So it may look like it got a bit slower,
but that is not the case, because the time to allocate this buffers is now saved before
lmmse demosaic.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-22 14:28:19


Beep6581 commented 9 years ago
Ilias, I was wrong in #62. There's the buffer for the decoded data (width*height*4 bytes),
which is freed in queue before demosaic.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-22 15:23:04

Beep6581 commented 9 years ago
Thanks for the patch, can I apply it on the latest commit ?

Reported by iliasgiarimis on 2015-02-22 16:08:15

Beep6581 commented 9 years ago
Yes.

Reported by heckflosse@i-weyrich.de on 2015-02-22 16:54:17

Beep6581 commented 9 years ago
I just tested the "no-crash limit" with v14.

Nothing changes regarding the crash point .. RT crashes at exactly the same size as
before. 
All is fine up to 9274 X 5755 but crashes at 9274 X 5756. This happens with all programs
closed and with RT in both edit and queue modes.

The only difference was that in queue mode the console stayed alive and a message appeared
.. see the attachment

Reported by iliasgiarimis on 2015-02-22 17:06:40


Beep6581 commented 9 years ago
I have to add that in edit mode the crash is instant while (as is obvious from the screenshot)
in queue RT passes all steps in LMMSE but crashes in the exit of lmmse !!  

Reported by iliasgiarimis on 2015-02-22 17:21:22

Beep6581 commented 9 years ago
Ilias, I have a look. Maybe there's an error in the patch. The problem is, that I can't
test it myself easily.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-22 17:32:56

Beep6581 commented 9 years ago
Ilias, the screenshot tells me, that lmmse completed without crash, so the error has
to be after lmmse. So reducing lmmse memory consumption seems to be worthless for this
case. That also explains #58.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-22 17:41:06

Beep6581 commented 9 years ago
Ilias, do you have hl recovery 'colour propagation' enabled?

Reported by heckflosse@i-weyrich.de on 2015-02-22 17:44:14

Beep6581 commented 9 years ago
#61 it was crashing instantly up to v13 with both edit and queue. Now with v14 it only
crashes instantly in edit mode while completes in queue.

#62 No, it is neutral pp3 + lmmse, color propagation is not used because it crashes
even with 35Mpix files .. in fact color propagation for big files is my next request
:) and then comes CIECAM CDBL .. something changed lately and it crashes on 24Mp files
(it was up to 28Mp after your memory consumption decreases) :(

Reported by iliasgiarimis on 2015-02-22 17:57:10

Beep6581 commented 9 years ago
Will my compilation help if used in a win32 virtual machine ?
https://drive.google.com/folderview?id=0B0NqktTgc54seURybXhsWE5QN00&usp=sharing

Reported by iliasgiarimis on 2015-02-22 18:06:18

Beep6581 commented 9 years ago
Ilias, I don't understand. In #63 you say it completes in queue and in #58 you say it
does not?

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-22 18:12:15

Beep6581 commented 9 years ago
Ilias, I'll try now with a Linux VM assigning only 3 GB of RAM to the VM.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-22 18:14:02

Beep6581 commented 9 years ago
in #63 I forgot the quotes .. should be "completes" to describe what we see in the screenshot
..  

Reported by iliasgiarimis on 2015-02-22 18:25:27

Beep6581 commented 9 years ago
Hmm, I just saved a 64.3 MP file using neutral profile + lmmse 6 steps from editor in
a linux box with just 1 GB RAM (but 8 GB Swap space). Took a while, because of swap,
but worked...

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-22 19:03:57

Beep6581 commented 9 years ago
Ilias, I think, I will commit patch 13. What do you think?

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-22 19:07:42

Beep6581 commented 9 years ago
Ilias, there's an earlier free of about width*height*4 bytes possible in the pipeline
for queue mode just after demosaic. I'll open an Issue when this one is committed.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-22 19:31:22

Beep6581 commented 9 years ago
I think patch13 is fine for commit and a very significant improvement both on speed
and "capacity" vs the already commited patch5. 

Regarding the decreased memory only in queue I think it is unusable for demosaicers..
one can only apply it blindly .. I find it of very limited use ?. 

I still do not understand how/why the decreased memory consumption in patch08 & patch14
and the decrease of memory consumption in queue mode play almost no role in the "no-crash
limit" improvement (patch 08 improved the limit from 44.5 to 45.3 only !!)

Reported by iliasgiarimis on 2015-02-22 20:23:52

Beep6581 commented 9 years ago
Ilias, ok, I'll commit patch 13 now.

'Regarding decreased memory only in queue' : What do you mean by 'one can only apply
it blindly'?

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-22 20:37:05

Beep6581 commented 9 years ago
By blindly I mean that I have to first apply LMMSE in edit mode to see how it looks,
which I cannot (crash). The alternatives of very limited use are
- to export with queue then reopen etc .. 
- to decrease the raw-crop size in camconst.json at lower than the edit mode "no-crash
limit", process as if it was full size .. then close RT, reset the normal raw-crop
in camconst.json and queue :) :)

Lets first see how the patch13 behaves with the various win32 machines, may be mine
is an exception .. and I have to just change to a modern win64 OS :)

Reported by iliasgiarimis on 2015-02-22 21:01:16

Beep6581 commented 9 years ago
Patch 13 committed to revision 0ab0d951e274

Reported by heckflosse@i-weyrich.de on 2015-02-22 21:02:12

Beep6581 commented 9 years ago
Ilias, ok!

Reported by heckflosse@i-weyrich.de on 2015-02-22 21:04:07

Beep6581 commented 9 years ago
Ilias, re #73. I spoke about queue export vs. 'save file' from edit mode. Just viewing
a file in editor is a different story and needs less resources.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-22 23:25:28

Beep6581 commented 9 years ago
Ingo, I only used viewing in editor and exporting with queue in my tests. There the
limit was 9274 X 5755

I just now tried with v14 exporting from edit mode and it crashes with 9274 X 5755
in the same way queue crashed with 9274 X 5756 (see screenshot at#58)

I will continue tomorrow .. 

Reported by iliasgiarimis on 2015-02-23 02:03:14

Beep6581 commented 9 years ago
Ilias, is the limit the same with other demosaicers?

Reported by heckflosse@i-weyrich.de on 2015-02-23 12:04:53

Beep6581 commented 9 years ago
For the editor view, demosaicers fast, amaze and IGV have no problem.

For exporting by save and/or queue I'll test ASAP

Reported by iliasgiarimis on 2015-02-23 16:04:06

Beep6581 commented 9 years ago
With AmaZe and queue I can go up to 9274 x 5800 but there I cannot always export the
jpeg/tiff because RT crashed on 4 from 7 tries (the same message as in the screenshot
#58)

Is it possible that it's a problem with a swap file ?. 
I am running out of space lately :( only 10GB free disk space ..)

Reported by iliasgiarimis on 2015-02-23 18:03:12

Beep6581 commented 9 years ago
Ok, so it's not really related to LMMSE

Reported by heckflosse@i-weyrich.de on 2015-02-23 18:17:18

Beep6581 commented 9 years ago
Small speedup for the median pass of lmmse. Just for the sake of completeness, nothing
to write home about.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-28 18:58:05


Beep6581 commented 9 years ago
Ingo, what kind of tests are needed to understand where is the failure and improve demosaic
"capacity" on win32 OS ?.

My main problem is that only LMMSE crashes on editing, all others pas this stage :(
then at exporting amaze also crashes but a bit higher. I didn't test any other ..I
have in plan to test IGV later today.   

Reported by iliasgiarimis on 2015-02-28 22:20:08

Beep6581 commented 9 years ago
Ilias, patch 15 doesn't change memory usage. It's just a small speedup for median pass.

About memory usage. Most of the other demosaicers ('Amaze', 'Fast', 'Vng4' ...) are
tiled and have a much less memory footprint. LMMSE still needs 5 buffers of size width*height*4
bytes (though that's much less than before). I can make an overview of memory footprint
for all demosaicers in case you're interested.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-28 22:35:42

Beep6581 commented 9 years ago
As I read in my taskmanager with 53.4Mp files where LMMSE crashes i have 
Total     3581 MB

System     800 MB
Base RT    900 MB
LMMSE     1068 MB (53.4X20)

Consumed  2768 MB 

I should have not run out of memory as I have 3.0GB for programs

Reported by iliasgiarimis on 2015-02-28 23:14:45

Beep6581 commented 9 years ago
Ilias, untiled processing has the downside of using not only more buffers than tiled
processing, but also much bigger buffers, which can lead to problems when memory is
fragmented.

Reported by heckflosse@i-weyrich.de on 2015-02-28 23:20:30

Beep6581 commented 9 years ago
lias, what do you mean by Base RT 900 MB ?????

Reported by heckflosse@i-weyrich.de on 2015-02-28 23:21:18

Beep6581 commented 9 years ago
Ilias, in #86, 'more buffers' should be 'more buffer space'.

Reported by heckflosse@i-weyrich.de on 2015-02-28 23:33:23

Beep6581 commented 9 years ago
Rough calculation for 53.4 MPixels:

rt base usage (about 50 MB)                                 :    52.428.800 Bytes
dcraw decoded Image (before preprocessing)                  :   213.600.000 Bytes
Buffer for rawdata (before demosaic)                        :   213.600.000 Bytes
Buffer for red, green and blue (alloacted before demosaic)  :   640.800.000 Bytes
5 Buffers during lmmse demosaic                             : 1.068.000.000 Bytes

Sum:                                                        : 2.188.428.800 Bytes

The problem is, that each of the 5 buffers inside lmmse demosaic needs about 200 MB.
It you fail to get one buffer of this size (because of memory fragmentation, lmmse
will fail)

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-28 23:46:18

Beep6581 commented 9 years ago
Sorry for the late responce, I was writing issue2700 ..

Base RT I mean memory consumed by RT after opening the 53.4 file and 900 was by (faulty
) memory, the list at #89 is exact.
But I don't understand how RT consistently crashes with 9274 X 5756 while has no problem
with 9274X5755 even if I kill some background processes  (+100MB available). If windows
vista32 cannot find a 200MB free block in 800MB available then .. I have to kill them
:)

Is there any utility?? for better memory organization?  

Reported by iliasgiarimis on 2015-03-01 00:24:20

Beep6581 commented 9 years ago
Ilias, memory fragmentation is a problem. In first place we have to avoid it by trying
to allocate big buffers (which after freeing lead to free buffers of this size). If
we don't get this big buffers, we can try to get more smaller buffers (but that's a
first step which leads to memory fragmentation, because there's no guarantee, that
2 buffers of size n after freeing will lead to one free buffer of size 2*n).

But I also don't understand why there's such a difference between allocating buffer
for 9274 X 5756 vs. 9274 X 5755

Ingo

Reported by heckflosse@i-weyrich.de on 2015-03-01 00:38:07

Beep6581 commented 9 years ago
>Ilias, memory fragmentation is a problem.

Could you please give more insight into this?

I cannot believe in that there is no 200MB of free pages among 800MB of free memory.
It can be confirmed with launching RT just after system startup - when all allocations
do not total to enough big number yet.

Is it somehow related to allocation mechanism of Glib and not allocation of Windows?

Reported by pinhuer on 2015-03-01 10:09:13

Beep6581 commented 9 years ago
Pinhuer, one example. Have a look at rawimagesource.cc line 482 compress_image():

dcraw holds the decoded data in a block of width*height*3*2 bytes (image). Then RT
allocates width*height*4 bytes, copies the data and frees the block dcraw allocated.

Now we have a free block of width*height*6 bytes. What happens, when we need two blocks
of size (width+20)*(heigh+20)*4 bytes (as in lmmse demosaic)? We get one and leave
a hole of size width*height*6 - (width+20)*(heigh+20)*4 bytes, which can not be used
for the next (width+20)*(heigh+20)*4 bytes buffer in lmmse.

Perhaps there are more of this things in RT.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-03-01 12:43:50

Beep6581 commented 9 years ago
Hmm may be I have to remind a strange behavior of LMMSE before 2665_0 patches.

LMMSE was crashing (in edit mode 100% view) with Nikon D800 (7380X4928 = 36368340)
but not on a raw crop of around 38Mp of Olympus E-M5MarkII (total 9280X6938 = 64384640)

Reported by iliasgiarimis on 2015-03-01 13:05:40

Beep6581 commented 9 years ago
Ilias, I'll make a patch which changes order of allocations. Then you can test again.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-03-01 13:56:41

Beep6581 commented 9 years ago
>Now we have a free block of width*height*6 bytes. What happens, when we need two blocks
of size (width+20)*(heigh+20)*4 bytes (as in lmmse demosaic)? We get one and leave
a hole of size width*height*6 - (width+20)*(heigh+20)*4 bytes, which can not be used
for the next (width+20)*(heigh+20)*4 bytes buffer in lmmse.

My question is: what is the size of "page"? 2*width*height is a magnitude bigger than
the size of pages used in x86, why can't this hole be reused?

Reported by pinhuer on 2015-03-01 14:11:57

Beep6581 commented 9 years ago
Ilias, here's the patch which changes order of allocations. Can you try whether you
can lmmse-demosaic larger files with this patch?

Before patch, memory allocation and free was in this order:

allocate raw_image (ucwidth*ucheight*2 bytes)
allocate image (ucwidth*ucheight*6 bytes)
free raw_image
allocate data (cwidth*cheight*4 bytes)
free image

Now memory allocation and free is in this order:

allocate data (cwidth*cheight*4 bytes)
allocate raw_image (ucwidth*ucheight*2 bytes)
allocate image (ucwidth*ucheight*6 bytes)
free raw_image
free image

where cwidth means cropped width and ucwidth means uncropped width (same for height).
Crop is the crop from camconst.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-03-01 19:34:12


Beep6581 commented 9 years ago
Ilias, now it's also clear why your 38Mp cropped raw worked fine and the 36Mp uncropped
did not: The freed buffers for the cropped raw were big enough to hold the data of
two bufffers of size (width+20)*(height+20)*4 bytes, whereas for the uncropped raw
only one size (width+20)*(height+20)*4 bytes buffer fit into this space.
In case raw_alloc.patch doesn't lead to increased capacity we can simply increase the
size of the allocations for image and raw_image a bit ;-)

Ingo

Reported by heckflosse@i-weyrich.de on 2015-03-01 20:10:28

Beep6581 commented 9 years ago
I beg about some clarification about why in 21st century multimegabyte memory hole is
dramatic. I "was told" that processors manage virtual memory with tens-of-KB grain,
what is happening here?

Reported by pinhuer on 2015-03-01 21:19:02

Beep6581 commented 9 years ago
Because we allocate large blocks of contiguous memory. Have a look here for some good
explanations: http://stackoverflow.com/questions/3770457/what-is-memory-fragmentation

Reported by heckflosse@i-weyrich.de on 2015-03-01 21:32:31

Beep6581 commented 9 years ago
issue2665_15.patch from #82 committed to revision 12f923055c82

Reported by heckflosse@i-weyrich.de on 2015-03-01 22:56:17