Beep6581 / RawTherapee

A powerful cross-platform raw photo processing program
https://rawtherapee.com
GNU General Public License v3.0
2.84k stars 320 forks source link

Speedup for LMMSE demosaic #2648

Closed Beep6581 closed 9 years ago

Beep6581 commented 9 years ago

Originally reported on Google Code with ID 2665

Opened this Issue to get a new Issue number before I start my work at this Issue.
As usual for demosaic speedups I'll make a series of patches, small steps.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-11 21:06:05

Beep6581 commented 9 years ago
Here's a first patch. Processing time for LMMSE demosaic of a D800 file (Cityhall) on
my 8 core was 3426 ms before patch.
The patch adds 4 omp pragmas, which reduce the processing time to 3107 ms at my system.
Not much (only 9%), but the key to optimize LMMSE seems to change layout of data in
memory. I'll continue tomorrow...

No changes to output with this patch.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-11 22:54:01


Beep6581 commented 9 years ago
Next one. Processing time now is about 2800 ms. Memory usage reduced by 8*width*height
bytes.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-12 18:17:39


Beep6581 commented 9 years ago
Next one. Quick and dirty changed layout of data in memory. Processing time now is about
1650 ms. No differences in output.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-13 23:52:26


Beep6581 commented 9 years ago
Results on my two machines:
Org: 5820k(x6): 1250ms    Phenom2-955(x4): 3560ms
P0:  5820k(x6): 1010ms    Phenom2-955(x4): 3380ms
P1:  5820k(x6):  980ms    Phenom2-955(x4): 3280ms
P2:  5820k(x6):  620ms    Phenom2-955(x4): 2080ms

No differences either, that quick and dirty change in patch 2 made quite a difference...
:)

/Reine

Really nice work!

Reported by reine.edvardsson on 2015-02-14 09:55:40

Beep6581 commented 9 years ago
With my C2D 3.0GHz 4GB (3GB for programs) winvista32, median of 6 measures on 24Mp file.

Time in ms, in brackets it's the time for "Lee refinement"

enh.steps no.patch      patch_02
  0       2660          1562
  1       2660          1685
  2       4620          2790
  3       6671          4084  
  4       8576          5304
  5      10050 (1485)   6915 (1510)
  6      11195 (2600)   7947 (2640)  

Memory consumption is reduced (peak-totalRT 1452MB vs 1261MB, peak-totalSystem 2300
vs 2100) but my machine still crashes immediately on big files. The largest that I
could render was 28Mp while crashes on 36Mp. I will try some intermediates to find
if there is any improvement. These totals are for editor so count around 500MB less
for queue.

I feel I am not out of memory .. RT crashes in queue mode also although the peak system
memory consumption is lower than 3.0GB  

Reported by iliasgiarimis on 2015-02-14 13:32:50

Beep6581 commented 9 years ago
Reine, Ilias, thanks for testing :-)

Ilias, actually lmmse allocates two buffers. One of them is really big: (width+20)*(height+20)*6*4
bytes. Next patch will include a change that allocates 6 buffers of (width+20)*(height+20)*4
bytes in case the allocation of the big buffer fails.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-14 14:34:43

Beep6581 commented 9 years ago

Reported by heckflosse@i-weyrich.de on 2015-02-14 14:34:50

Beep6581 commented 9 years ago
Here's the patch with the changes mentioned in #6. Also a bit faster than last one.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-14 15:09:06


Beep6581 commented 9 years ago
Ingo, thanks

On my machine .. The new v3 version compared to v2 is .. 
- a bit slower at the basic LMMSE (enh.step 0) by about 2% although this could be statistical
error.
- a bit faster at enh.step1 by around 2% .. looks like the applying the gamma got alot
faster by 50% :)
- each median step got 10-20% faster making the enh.steps 2-3-4 4%-8% faster
- small speed increase in Lee refinement and 

- The crash now happens even lower, I now cannot process 28Mp files wich were no problem
for the previous versions .. it's a sudden crash immediately as I choose LMSSE at all
enh.step without any message :(

Reported by iliasgiarimis on 2015-02-14 19:44:39

Beep6581 commented 9 years ago
EDIT .. I was wrong about the crashes with v2 .. I have the same problem at the same
Mp limits as with v3 ..

Unpatched LMMSE works fine with 28Mp files.  

Reported by iliasgiarimis on 2015-02-14 20:00:59

Beep6581 commented 9 years ago
Ilias, please post a link to a file which crashes. Perhaps there's an error in code...

And please have a look at console output. If the big block couldn't be allocated there
should be a message. The number of enh.steps doesn't have an influence on peak memory
usage of lmmse btw.
The speedup in last patch was only in median steps.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-14 20:28:07

Beep6581 commented 9 years ago
Ilias, there's a bug in my tricky memory addressing. I'll post a new patch when I fixed
it.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-14 21:26:09

Beep6581 commented 9 years ago
No message in console see the copy ..

Samsung NX1.badpixels not found
Preprocessing: 2200676 usec
Demosaic Bayer image using method: fast
Demosaicing Bayer data: fast - 691236 usec
Applying white balance, color correction & sRBG conversion...
setscale before lock
setscale starts (649, 433)
setscale ends
setscale ends2
ImProcCoordinator / Auto CT:  indi=1   satH=0  satPR=0
setcropsizes before lock
setsizes starts (649, 433, -1, -1, 649, 433)
setsizes ends
setcropsizes before lock
setsizes starts (1082, 722, 649, 433, 1082, 722)
setsizes ends
Samsung NX1.badpixels not found
Preprocessing: 2056553 usec
Demosaic Bayer image using method: lmmse

Nothing more. 

Here at the raw links two (you can find more like NikonD810 in the model's list instead
of Phase1) of the largest RAW samples (64mp, 80mp) you can try
http://www.dpreview.com/reviews/image-comparison/fullscreen?attr18=daylight&attr13_0=oly_em5ii&attr13_1=phaseone_iq180&attr15_0=raw&attr15_1=raw&attr16_0=200&attr16_1=35&attr126_0=highres&normalization=full&widget=194&x=0.00147034251&y=-0.00405094028

I am starting now to find the no crashing limit by changing the raw crop for Olympus
E-M5MarkII 

camconst data ..

    { // Quality X, experimental, new model with 16Mp and 64Mp raw frames
        "make_model": "OLYMPUS E-M5MarkII",
        "dcraw_matrix": [ 8461,-2320,-573,-3319,10974,2699,-1259,2049,5838 ], // D65,
Built on Dpreview P2050161a.DNG studio shot with x-rite's colochecker passport utility
      //  "dcraw_matrix": [ 8380,-2630,-639,-2887,10725,2496,-627,1427,5438 ], // Copy
from E-M5 D65
        "raw_crop": [ 0, 0, -8, -8 ], // largest valid, full 64Mp 9280x6938, official
crop 0 0 9216 6912 
        "ranges": {
            "white": [
               { "iso": [ 100, 200 ], "levels": 3956 }, // normal 4080-4095, HR Dpreview
4047, IR 3956 
               { "iso": [ 400, 800, 1600, 3200 ], "levels": 4070 }, // 4070-4095 
               { "iso": [ 6400, 12800, 25600 ], "levels": 4040 } // 4000-4095
                     ]
                  }
    },

change to "raw_crop": [ 0, 0, 6000, 5000 ], for 30Mp or any size you like up to 9280X6938

Reported by iliasgiarimis on 2015-02-14 21:38:45

Beep6581 commented 9 years ago
Ilias, wait with your tests until I posted a new patch. The crashes are caused by accessing
buffer out of bounds (independent on image size).

Ingi

Reported by heckflosse@i-weyrich.de on 2015-02-14 21:48:17

Beep6581 commented 9 years ago
This patch should fix the crashes caused by my buggy calculation of start address of
two buffers (image[1] and image[2]). At least valgrind doesn't report invalid reads
and writes anymore, where it did with patch 2 and patch 3.

Next patch will include a small speedup for the 'Lee refinement'

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-14 22:47:47


Beep6581 commented 9 years ago
Ohh I had closed mozila to make the tests ..

So far I only tested with the unpatched RT :)

I can go up to 37.5Mp (7500X5000) in Olympus' file but with D810's 36Mp files RT crashed.
So I also suspected a problem with the bounds .. I mean sensor's bounds .. just decreased
D810's frame by some and I could use LMMSE 
So the problem existed before the recent patches .. and the crash is so harsh like
when we had problems with the frame in camconst.json being larger than in Dcraw.cc
.. is LMMSE attempting to somehow add borders ?.

Now for the crash at >37.5Mp .. the peak total memory was around 2.8GB 

I will continue testing with the new patch, thanks :)

Reported by iliasgiarimis on 2015-02-14 23:22:34

Beep6581 commented 9 years ago
Ilias, I'm absolutely sure that this Issue will result in a good speedup and less memory
usage for lmmse, even though I introduce bugs with some patches ;-)

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-14 23:38:10

Beep6581 commented 9 years ago
Just to make an example, how error prone patches can be:

Patch 02 had this wrong calculation:

image[1] = imageBuffer + ((height+1)/2)*(width+1)/2;

Patch 04 has this correct calculation:

image[1] = imageBuffer + ((height+1)/2)*((width+1)/2);

:-)

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-14 23:58:00

Beep6581 commented 9 years ago
I would bet that there is no error there :)
The speedup is already significant !!. And the capacitance a lot more, now I can demosaic
up to 45Mp.  

I tested patch4 .. on 24Mp results are medians of six measures

enh.steps no.patch      patch_04
  0       2660          1767
  1       2660          1827
  2       4620          2800
  3       6671          3983  
  4       8576          5085
  5      10050 (1485)   6633 (1470)
  6      11195 (2600)   7769 (2610)

A bit slower than patch_3. Total System Memory consumption with 44Mp reaches 2.6GB
(in queue mode only 2.1GB) but RT still crashes on 45Mp :(  
You will have a look, I know :) .. 

Reported by iliasgiarimis on 2015-02-15 01:57:42

Beep6581 commented 9 years ago
This patch includes the small speedup for "Lee refinement" mentioned in #15. Also needs
width*height*12 bytes less memory in "Lee refinement", but that should have no influence
on peak memory usage of lmmse.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-15 15:12:00


Beep6581 commented 9 years ago
If 1.7-2X !! faster is "small speedup" then what should we expect in the end ?? :)
enh.
steps  no.patch     patch_04     patch_05
 0     2660         1767
 1     2660         1827
 2     4620         2800
 3     6671         3983  
 4     8576         5085
 5    10050 (1485)  6633 (1470)  5873 (738)
 6    11195 (2600)  7769 (2610)  6662 (1552)

Reported by iliasgiarimis on 2015-02-15 18:30:37

Beep6581 commented 9 years ago
Ilias, I don't know what to expect at the end of this issue. I just started ;-)

Reported by heckflosse@i-weyrich.de on 2015-02-15 19:10:41

Beep6581 commented 9 years ago
I would like to commit patch 05 before I continue optimizing lmmse. Any objections?

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-16 11:22:45

Beep6581 commented 9 years ago
No objections from me.

I fact it is safer to follow up the already commited issue2647 with the updated LMMSE
with  as now the default is to use the pp3's demosaicing at opening. As LMMSE5 is faster
and has a larger capacity for big files it must be commited 
Still remains the slight possibility for unexpected crashes when one transfers big
raws (>44Mb) from 64bit to 32bit machines .. 

Reported by iliasgiarimis on 2015-02-16 12:19:54

Beep6581 commented 9 years ago
Committed to revision 3c75597e2f9d
Issue stays open for further improvements.

Reported by heckflosse@i-weyrich.de on 2015-02-16 13:10:43

Beep6581 commented 9 years ago
Today there's only a small lmmse speedup. Less than 10% faster than previous one. But
in #0 I already said, I'll make small steps, so here's one of this small steps today
;-) Still wip...

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-16 23:12:29


Beep6581 commented 9 years ago
Next small step. Additional speedup for the median pass.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-17 13:12:28


Beep6581 commented 9 years ago
No reasonable speedup this time, but peak memory usage reduced by (width+20)*(height+20)*4
bytes. Still wip...

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-17 22:40:36


Beep6581 commented 9 years ago
Small speedup (about 5% to 10% faster than issue2665_08.patch) and another reduction
of peak memory usage by width*height*4 bytes.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-18 14:41:20


Beep6581 commented 9 years ago
Looks good
      5820k           Phenom2 955
      2st.    6 st.   2st.    6st.
Org   1250            3560
P0    1010            3380
P1     980            3280
P2     620            2080
P6     560    1135    1925     3580
P7     590    1170    1900     3515
P8     590    1190    1920     3540
P9     550    1150    1880     3500

One interesting thing is that p7 and p8 was a tad bit slower on my intel, but on the
AMD p7 was faster then p6... interesting differences :). But they are quite small so
not anything to call home about.
I checked the output between all runs (took some time as I checked all steps from 1
to 6 on all patches...) Looks good!

/Reine

Reported by reine.edvardsson on 2015-02-18 21:26:07

Beep6581 commented 9 years ago
Reine, thanks for testing :-)

The main target of last patches was to reduce peak memory usage for systems which are
low on memory. Though we also got a speedup by doing this.

Further speedups need tiled processing (alternative to tiled processing also striped
processing is possible at some parts of code) and SSE-Code (in that order).

The problem with tiled processing is the large border (10 pix for each side), lmmse
actually uses. In tiled processing that leads to a big overhead by overlapping tiles
when we choose a small tile size. Though the border can be reduced by at least 2 pix
without influence to the output, it will have an impact on processing time of tiled
mode.

The refinement step can also get a speed up by overlapping striped processing (That's
the first thing, I'll try)

The problem with SSE-code is that many loops increment by two columns, which is not
optimal for using SSE-Code.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-18 22:59:27

Beep6581 commented 9 years ago
I added SSE-code for Lee refinement.

Reported by heckflosse@i-weyrich.de on 2015-02-19 15:50:05


Beep6581 commented 9 years ago
I added SSE-code for one part of the median-step. We can use this vectorized median
also at other parts where med3x3 is used.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-20 00:20:31


Beep6581 commented 9 years ago
Further SSE speedups will follow soon.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-20 00:52:36

Beep6581 commented 9 years ago
On my intel core2duo 4GB, win vista 32bit .. 
Very small speed improvements with patches 07-09 (around 1% for each patch), great
speed improvement for refinement with patch10 !!. Some results on patch10 are possibly
affected by something else running at the background .. :(

Very small improvement in Mpixel "capacity" with patch08, very significant improvement
with patch09 although both decrease the memory by almost the same width*height*4  
RT still crashes with no message when Mpixels are over the limit. The no crash limit
is not affected by using queue (500MB less memory consumption by RT vs edit mode) !!.

         no.patch   patch_04   patch_05   patch07     patch08    patch09    patch10
capacity 37.5Mp     44.5Mp                44.5Mp      45.3Mp     53.3Mp     53.3Mp

step0    2660       1767                  1722        1714       1755       1802
step1    2660       1827                  1813 med    1772 med   1779 med   1801 med
step2    4620       2800                  2790(0985)  2729(0952) 2726(0943) 2781(0945)
step3    6671       3983                  3916(2111)  3850(2066) 3832(2064) 3907(2076)
step4    8576       5085                  5009(3203)  4934(3160) 4922(3153) 5005(3165)
               ref        ref       ref        ref         ref        ref        ref
step5   10050(1485) 6633(1470) 5873( 738) 5765( 734)  5755( 768) 5639( 719) 5475( 484)
step6   11195(2600) 7769(2610) 6662(1552) 6625(1508)  6468(1493) 6409(1498) 6091(1040)

Reported by iliasgiarimis on 2015-02-20 11:01:49

Beep6581 commented 9 years ago
Around 2.5-2.8X speed improvement for median passes with patch11 vs patch10 !!.  

         no.patch   patch_04   patch_05   patch07     patch08    patch09    patch10
   patch11
capacity 37.5Mp     44.5Mp                44.5Mp      45.3Mp     53.3Mp     53.3Mp
    53.3Mpix

step0    2660       1767                  1722        1714       1755       1802  
    1766
step1    2660       1827                  1813 med    1772 med   1779 med   1801 med
  1781 median
step2    4620       2800                  2790(0985)  2729(0952) 2726(0943) 2781(0945)
2158( 384)
step3    6671       3983                  3916(2111)  3850(2066) 3832(2064) 3907(2076)
2590( 763)
step4    8576       5085                  5009(3203)  4934(3160) 4922(3153) 5005(3165)
2935(1156)
               ref        ref       ref        ref         ref        ref        ref
      refine
step5   10050(1485) 6633(1470) 5873( 738) 5765( 734)  5755( 768) 5639( 719) 5475( 484)
3426( 484)
step6   11195(2600) 7769(2610) 6662(1552) 6625(1508)  6468(1493) 6409(1498) 6091(1040)
3989(1060)

Reported by iliasgiarimis on 2015-02-20 15:21:39

Beep6581 commented 9 years ago
Good work! :)
Forgot to mention, I have Linux mint x64 on both my machines, also I am running 1 to
6 steps and comparing for the sake of it, but I am to lazy to fill in all the columns
of times :).
      5820k           Phenom2 955
      2st.    6st.    2st.     6st.
Org   1250            3560
P0    1010            3380
P1     980            3280
P2     620            2080
P6     560    1135    1925     3580
P7     590    1170    1900     3515
P8     590    1190    1920     3540
P9     550    1150    1880     3500
P10    540     970    1835     2960
P11    460     625    1680     2510

/Reine

Reported by reine.edvardsson on 2015-02-20 18:33:22

Beep6581 commented 9 years ago
Ilias, Reine, thanks for testing. I added SSE code for another loop.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-20 20:46:24


Beep6581 commented 9 years ago
Around 1.10X speed improvement for basic LMMSE with patch12 vs patch11 :)  .. and faster
2nd pass Lee refinement .. now 2 passes take 2X time vs 1 pass while it used to be
>2X with patches 10-11.

         no.patch   patch_04   patch_05   patch07     patch08    patch09    patch10
   patch11    patch12
capacity 37.5Mp     44.5Mp                44.5Mp      45.3Mp     53.3Mp     53.3Mp
    53.3Mpix   53.3

step0    2660       1767                  1722        1714       1755       1802  
    1766       1581
step1    2660       1827                  1813 med    1772 med   1779 med   1801 med
  1781 med   1649 med
step2    4620       2800                  2790(0985)  2729(0952) 2726(0943) 2781(0945)
2158( 384) 1998( 384)
step3    6671       3983                  3916(2111)  3850(2066) 3832(2064) 3907(2076)
2590( 763) 2363( 765)
step4    8576       5085                  5009(3203)  4934(3160) 4922(3153) 5005(3165)
2935(1156) 2765(1148)
               ref        ref       ref        ref         ref        ref        ref
      refine     refine
step5   10050(1485) 6633(1470) 5873( 738) 5765( 734)  5755( 768) 5639( 719) 5475( 484)
3426( 484) 3269( 484)
step6   11195(2600) 7769(2610) 6662(1552) 6625(1508)  6468(1493) 6409(1498) 6091(1040)
3989(1060) 3720( 973)

Reported by iliasgiarimis on 2015-02-21 11:11:02

Beep6581 commented 9 years ago
Ilias, thanks for testing.

Here's the patch I would like to commit (after removing the Stopwatches). I also would
like to close the Issue with this patch.

I cleaned the code, made another very very small speedup and introduced two SSE4.1
intrinsics for the users of native x64 builds (in case the cpu supports SSE4.1). One
of the SSE4.1 changes has also some influence on speed of other parts of RT, but I
didn't benchmark this cases.
Thanks to Reine for helping me to find the (hopefully) correct way to include the SSE
header files for Linux.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-21 21:12:58


Beep6581 commented 9 years ago
Linux Mint x64 on both machines, tested with 36MP D800 image.
      5820k           Phenom2 955
      2st.    6st.    2st.     6st.
Org   1250    2860    3560     8050
P0    1010            3380
P1     980            3280
P2     620            2080
P6     560    1135    1925     3580
P7     590    1170    1900     3515
P8     590    1190    1920     3540
P9     550    1150    1880     3500
P10    540     970    1835     2960
P11    460     645    1680     2510
P12    320     600    1550     2430
P13    300     580    1470     2300

There are differences in the image for p12 and p13 (for p13, the difference from p12
is only on my intel, so related to AVX optimizations I guess), but only scattered pixels
according to ImageMagick compare, I tried to actually see any difference looking at
two images but there wasn't anything that I could see... Most likelly the differences
are like one or two steps in the 16bit tiff and my guess is is rounding differences
(just a guess though :) ).
Spedup on the intel machine: 4-5 times
Spedup on the AMD machine: 2-4 times

Fantastic work Ingo!

/Reine

Reported by reine.edvardsson on 2015-02-21 22:25:45

Beep6581 commented 9 years ago
Reine, thanks for testing and for your help with SSE includes on Linux. I forgot to
mention that with patch 13 I also enabled FMA (not AVX) at one part of sleef library
for machines with FMA feature. That can lead to very small differences between P12
and P13, because FMA has a bit higher precision (one less rounding step) for this d
= a+b*c operations.
Difference between P11 and P12 is caused by changing some a = b - a to a -= b, which
normally isn't correct, but in this case the results go into a SQR, so it doesn't really
matter.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-21 22:49:26

Beep6581 commented 9 years ago
FMA explains it, as the old AMD does not support that :)
Thanks for the info!
/Reine

Reported by reine.edvardsson on 2015-02-21 23:13:59

Beep6581 commented 9 years ago
HDR DNG (*that* one), i7 CPU Q 820 @ 1.73GHz, GCC-4.9.2

Patch 00:
lmmse_interpolate_omp took 648 ms
lmmse_interpolate_omp took 578 ms
lmmse_interpolate_omp took 565 ms
lmmse_interpolate_omp took 538 ms

Patch 13:
median pass took 56 ms
lmmse_interpolate_omp took 380 ms
median pass took 56 ms
lmmse_interpolate_omp took 377 ms
median pass took 57 ms
lmmse_interpolate_omp took 399 ms

No differences in output. Green light for commit and thank you :)

Reported by entertheyoni on 2015-02-22 01:11:56

Beep6581 commented 9 years ago
Ingo, thanks

 no objection for commiting v13 although 32bit win machines still crash on large files
(>53Mp). And it looks like it is not exactly memory missing because RT crashes at the
same 54Mpixel files both when in edit and in queue mode (around 700MB less memory consumption
with queue). 

I will test V13 tomorrow .. 

Reported by iliasgiarimis on 2015-02-22 01:15:30

Beep6581 commented 9 years ago
DrSLony, thanks for testing!

Ilias, though I don't expect 'no crash' with large files in Win32, I'll wait with commit
until you tested V13 ;-)

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-22 01:27:58

Beep6581 commented 9 years ago
Ilias, we should think about the 'crashes'. It's no problem to avoid the crashes. But
actually I don't know how to communicate the 'avoid crash (out of menory)' to the user...

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-22 01:39:38

Beep6581 commented 9 years ago
No speed changes with patch13 vs patch12 .. The "nocrash" limit remains also the same
.. at 53.3Mp (

Reported by iliasgiarimis on 2015-02-22 11:05:49

Beep6581 commented 9 years ago
Ilias, it should be possible to reduce peak memory usage by another width*height*8 bytes.
I'll try that before commit.

Ingo

Reported by heckflosse@i-weyrich.de on 2015-02-22 11:59:50

Beep6581 commented 9 years ago
Ingo, what I don't understand is why there is no difference regarding the "no-crash
limit" between 
- edit mode where RT uses 1.15GB steadily and climbs to 2.22 with LMMSE
- queue where RT uses 35MB and climbs at 1.55GB with LMMSE

Reported by iliasgiarimis on 2015-02-22 12:29:42