Closed Beep6581 closed 9 years ago
Here's a first patch. Processing time for LMMSE demosaic of a D800 file (Cityhall) on
my 8 core was 3426 ms before patch.
The patch adds 4 omp pragmas, which reduce the processing time to 3107 ms at my system.
Not much (only 9%), but the key to optimize LMMSE seems to change layout of data in
memory. I'll continue tomorrow...
No changes to output with this patch.
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-11 22:54:01
Next one. Processing time now is about 2800 ms. Memory usage reduced by 8*width*height
bytes.
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-12 18:17:39
Next one. Quick and dirty changed layout of data in memory. Processing time now is about
1650 ms. No differences in output.
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-13 23:52:26
Results on my two machines:
Org: 5820k(x6): 1250ms Phenom2-955(x4): 3560ms
P0: 5820k(x6): 1010ms Phenom2-955(x4): 3380ms
P1: 5820k(x6): 980ms Phenom2-955(x4): 3280ms
P2: 5820k(x6): 620ms Phenom2-955(x4): 2080ms
No differences either, that quick and dirty change in patch 2 made quite a difference...
:)
/Reine
Really nice work!
Reported by reine.edvardsson
on 2015-02-14 09:55:40
With my C2D 3.0GHz 4GB (3GB for programs) winvista32, median of 6 measures on 24Mp file.
Time in ms, in brackets it's the time for "Lee refinement"
enh.steps no.patch patch_02
0 2660 1562
1 2660 1685
2 4620 2790
3 6671 4084
4 8576 5304
5 10050 (1485) 6915 (1510)
6 11195 (2600) 7947 (2640)
Memory consumption is reduced (peak-totalRT 1452MB vs 1261MB, peak-totalSystem 2300
vs 2100) but my machine still crashes immediately on big files. The largest that I
could render was 28Mp while crashes on 36Mp. I will try some intermediates to find
if there is any improvement. These totals are for editor so count around 500MB less
for queue.
I feel I am not out of memory .. RT crashes in queue mode also although the peak system
memory consumption is lower than 3.0GB
Reported by iliasgiarimis
on 2015-02-14 13:32:50
Reine, Ilias, thanks for testing :-)
Ilias, actually lmmse allocates two buffers. One of them is really big: (width+20)*(height+20)*6*4
bytes. Next patch will include a change that allocates 6 buffers of (width+20)*(height+20)*4
bytes in case the allocation of the big buffer fails.
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-14 14:34:43
Reported by heckflosse@i-weyrich.de
on 2015-02-14 14:34:50
PatchSubmitted
Here's the patch with the changes mentioned in #6. Also a bit faster than last one.
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-14 15:09:06
Ingo, thanks
On my machine .. The new v3 version compared to v2 is ..
- a bit slower at the basic LMMSE (enh.step 0) by about 2% although this could be statistical
error.
- a bit faster at enh.step1 by around 2% .. looks like the applying the gamma got alot
faster by 50% :)
- each median step got 10-20% faster making the enh.steps 2-3-4 4%-8% faster
- small speed increase in Lee refinement and
- The crash now happens even lower, I now cannot process 28Mp files wich were no problem
for the previous versions .. it's a sudden crash immediately as I choose LMSSE at all
enh.step without any message :(
Reported by iliasgiarimis
on 2015-02-14 19:44:39
EDIT .. I was wrong about the crashes with v2 .. I have the same problem at the same
Mp limits as with v3 ..
Unpatched LMMSE works fine with 28Mp files.
Reported by iliasgiarimis
on 2015-02-14 20:00:59
Ilias, please post a link to a file which crashes. Perhaps there's an error in code...
And please have a look at console output. If the big block couldn't be allocated there
should be a message. The number of enh.steps doesn't have an influence on peak memory
usage of lmmse btw.
The speedup in last patch was only in median steps.
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-14 20:28:07
Ilias, there's a bug in my tricky memory addressing. I'll post a new patch when I fixed
it.
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-14 21:26:09
No message in console see the copy ..
Samsung NX1.badpixels not found
Preprocessing: 2200676 usec
Demosaic Bayer image using method: fast
Demosaicing Bayer data: fast - 691236 usec
Applying white balance, color correction & sRBG conversion...
setscale before lock
setscale starts (649, 433)
setscale ends
setscale ends2
ImProcCoordinator / Auto CT: indi=1 satH=0 satPR=0
setcropsizes before lock
setsizes starts (649, 433, -1, -1, 649, 433)
setsizes ends
setcropsizes before lock
setsizes starts (1082, 722, 649, 433, 1082, 722)
setsizes ends
Samsung NX1.badpixels not found
Preprocessing: 2056553 usec
Demosaic Bayer image using method: lmmse
Nothing more.
Here at the raw links two (you can find more like NikonD810 in the model's list instead
of Phase1) of the largest RAW samples (64mp, 80mp) you can try
http://www.dpreview.com/reviews/image-comparison/fullscreen?attr18=daylight&attr13_0=oly_em5ii&attr13_1=phaseone_iq180&attr15_0=raw&attr15_1=raw&attr16_0=200&attr16_1=35&attr126_0=highres&normalization=full&widget=194&x=0.00147034251&y=-0.00405094028
I am starting now to find the no crashing limit by changing the raw crop for Olympus
E-M5MarkII
camconst data ..
{ // Quality X, experimental, new model with 16Mp and 64Mp raw frames
"make_model": "OLYMPUS E-M5MarkII",
"dcraw_matrix": [ 8461,-2320,-573,-3319,10974,2699,-1259,2049,5838 ], // D65,
Built on Dpreview P2050161a.DNG studio shot with x-rite's colochecker passport utility
// "dcraw_matrix": [ 8380,-2630,-639,-2887,10725,2496,-627,1427,5438 ], // Copy
from E-M5 D65
"raw_crop": [ 0, 0, -8, -8 ], // largest valid, full 64Mp 9280x6938, official
crop 0 0 9216 6912
"ranges": {
"white": [
{ "iso": [ 100, 200 ], "levels": 3956 }, // normal 4080-4095, HR Dpreview
4047, IR 3956
{ "iso": [ 400, 800, 1600, 3200 ], "levels": 4070 }, // 4070-4095
{ "iso": [ 6400, 12800, 25600 ], "levels": 4040 } // 4000-4095
]
}
},
change to "raw_crop": [ 0, 0, 6000, 5000 ], for 30Mp or any size you like up to 9280X6938
Reported by iliasgiarimis
on 2015-02-14 21:38:45
Ilias, wait with your tests until I posted a new patch. The crashes are caused by accessing
buffer out of bounds (independent on image size).
Ingi
Reported by heckflosse@i-weyrich.de
on 2015-02-14 21:48:17
This patch should fix the crashes caused by my buggy calculation of start address of
two buffers (image[1] and image[2]). At least valgrind doesn't report invalid reads
and writes anymore, where it did with patch 2 and patch 3.
Next patch will include a small speedup for the 'Lee refinement'
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-14 22:47:47
Ohh I had closed mozila to make the tests ..
So far I only tested with the unpatched RT :)
I can go up to 37.5Mp (7500X5000) in Olympus' file but with D810's 36Mp files RT crashed.
So I also suspected a problem with the bounds .. I mean sensor's bounds .. just decreased
D810's frame by some and I could use LMMSE
So the problem existed before the recent patches .. and the crash is so harsh like
when we had problems with the frame in camconst.json being larger than in Dcraw.cc
.. is LMMSE attempting to somehow add borders ?.
Now for the crash at >37.5Mp .. the peak total memory was around 2.8GB
I will continue testing with the new patch, thanks :)
Reported by iliasgiarimis
on 2015-02-14 23:22:34
Ilias, I'm absolutely sure that this Issue will result in a good speedup and less memory
usage for lmmse, even though I introduce bugs with some patches ;-)
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-14 23:38:10
Just to make an example, how error prone patches can be:
Patch 02 had this wrong calculation:
image[1] = imageBuffer + ((height+1)/2)*(width+1)/2;
Patch 04 has this correct calculation:
image[1] = imageBuffer + ((height+1)/2)*((width+1)/2);
:-)
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-14 23:58:00
I would bet that there is no error there :)
The speedup is already significant !!. And the capacitance a lot more, now I can demosaic
up to 45Mp.
I tested patch4 .. on 24Mp results are medians of six measures
enh.steps no.patch patch_04
0 2660 1767
1 2660 1827
2 4620 2800
3 6671 3983
4 8576 5085
5 10050 (1485) 6633 (1470)
6 11195 (2600) 7769 (2610)
A bit slower than patch_3. Total System Memory consumption with 44Mp reaches 2.6GB
(in queue mode only 2.1GB) but RT still crashes on 45Mp :(
You will have a look, I know :) ..
Reported by iliasgiarimis
on 2015-02-15 01:57:42
This patch includes the small speedup for "Lee refinement" mentioned in #15. Also needs
width*height*12 bytes less memory in "Lee refinement", but that should have no influence
on peak memory usage of lmmse.
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-15 15:12:00
If 1.7-2X !! faster is "small speedup" then what should we expect in the end ?? :)
enh.
steps no.patch patch_04 patch_05
0 2660 1767
1 2660 1827
2 4620 2800
3 6671 3983
4 8576 5085
5 10050 (1485) 6633 (1470) 5873 (738)
6 11195 (2600) 7769 (2610) 6662 (1552)
Reported by iliasgiarimis
on 2015-02-15 18:30:37
Ilias, I don't know what to expect at the end of this issue. I just started ;-)
Reported by heckflosse@i-weyrich.de
on 2015-02-15 19:10:41
I would like to commit patch 05 before I continue optimizing lmmse. Any objections?
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-16 11:22:45
No objections from me.
I fact it is safer to follow up the already commited issue2647 with the updated LMMSE
with as now the default is to use the pp3's demosaicing at opening. As LMMSE5 is faster
and has a larger capacity for big files it must be commited
Still remains the slight possibility for unexpected crashes when one transfers big
raws (>44Mb) from 64bit to 32bit machines ..
Reported by iliasgiarimis
on 2015-02-16 12:19:54
Committed to revision 3c75597e2f9d
Issue stays open for further improvements.
Reported by heckflosse@i-weyrich.de
on 2015-02-16 13:10:43
Today there's only a small lmmse speedup. Less than 10% faster than previous one. But
in #0 I already said, I'll make small steps, so here's one of this small steps today
;-) Still wip...
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-16 23:12:29
Next small step. Additional speedup for the median pass.
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-17 13:12:28
No reasonable speedup this time, but peak memory usage reduced by (width+20)*(height+20)*4
bytes. Still wip...
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-17 22:40:36
Small speedup (about 5% to 10% faster than issue2665_08.patch) and another reduction
of peak memory usage by width*height*4 bytes.
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-18 14:41:20
Looks good
5820k Phenom2 955
2st. 6 st. 2st. 6st.
Org 1250 3560
P0 1010 3380
P1 980 3280
P2 620 2080
P6 560 1135 1925 3580
P7 590 1170 1900 3515
P8 590 1190 1920 3540
P9 550 1150 1880 3500
One interesting thing is that p7 and p8 was a tad bit slower on my intel, but on the
AMD p7 was faster then p6... interesting differences :). But they are quite small so
not anything to call home about.
I checked the output between all runs (took some time as I checked all steps from 1
to 6 on all patches...) Looks good!
/Reine
Reported by reine.edvardsson
on 2015-02-18 21:26:07
Reine, thanks for testing :-)
The main target of last patches was to reduce peak memory usage for systems which are
low on memory. Though we also got a speedup by doing this.
Further speedups need tiled processing (alternative to tiled processing also striped
processing is possible at some parts of code) and SSE-Code (in that order).
The problem with tiled processing is the large border (10 pix for each side), lmmse
actually uses. In tiled processing that leads to a big overhead by overlapping tiles
when we choose a small tile size. Though the border can be reduced by at least 2 pix
without influence to the output, it will have an impact on processing time of tiled
mode.
The refinement step can also get a speed up by overlapping striped processing (That's
the first thing, I'll try)
The problem with SSE-code is that many loops increment by two columns, which is not
optimal for using SSE-Code.
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-18 22:59:27
I added SSE-code for Lee refinement.
Reported by heckflosse@i-weyrich.de
on 2015-02-19 15:50:05
I added SSE-code for one part of the median-step. We can use this vectorized median
also at other parts where med3x3 is used.
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-20 00:20:31
Further SSE speedups will follow soon.
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-20 00:52:36
On my intel core2duo 4GB, win vista 32bit ..
Very small speed improvements with patches 07-09 (around 1% for each patch), great
speed improvement for refinement with patch10 !!. Some results on patch10 are possibly
affected by something else running at the background .. :(
Very small improvement in Mpixel "capacity" with patch08, very significant improvement
with patch09 although both decrease the memory by almost the same width*height*4
RT still crashes with no message when Mpixels are over the limit. The no crash limit
is not affected by using queue (500MB less memory consumption by RT vs edit mode) !!.
no.patch patch_04 patch_05 patch07 patch08 patch09 patch10
capacity 37.5Mp 44.5Mp 44.5Mp 45.3Mp 53.3Mp 53.3Mp
step0 2660 1767 1722 1714 1755 1802
step1 2660 1827 1813 med 1772 med 1779 med 1801 med
step2 4620 2800 2790(0985) 2729(0952) 2726(0943) 2781(0945)
step3 6671 3983 3916(2111) 3850(2066) 3832(2064) 3907(2076)
step4 8576 5085 5009(3203) 4934(3160) 4922(3153) 5005(3165)
ref ref ref ref ref ref ref
step5 10050(1485) 6633(1470) 5873( 738) 5765( 734) 5755( 768) 5639( 719) 5475( 484)
step6 11195(2600) 7769(2610) 6662(1552) 6625(1508) 6468(1493) 6409(1498) 6091(1040)
Reported by iliasgiarimis
on 2015-02-20 11:01:49
Around 2.5-2.8X speed improvement for median passes with patch11 vs patch10 !!.
no.patch patch_04 patch_05 patch07 patch08 patch09 patch10
patch11
capacity 37.5Mp 44.5Mp 44.5Mp 45.3Mp 53.3Mp 53.3Mp
53.3Mpix
step0 2660 1767 1722 1714 1755 1802
1766
step1 2660 1827 1813 med 1772 med 1779 med 1801 med
1781 median
step2 4620 2800 2790(0985) 2729(0952) 2726(0943) 2781(0945)
2158( 384)
step3 6671 3983 3916(2111) 3850(2066) 3832(2064) 3907(2076)
2590( 763)
step4 8576 5085 5009(3203) 4934(3160) 4922(3153) 5005(3165)
2935(1156)
ref ref ref ref ref ref ref
refine
step5 10050(1485) 6633(1470) 5873( 738) 5765( 734) 5755( 768) 5639( 719) 5475( 484)
3426( 484)
step6 11195(2600) 7769(2610) 6662(1552) 6625(1508) 6468(1493) 6409(1498) 6091(1040)
3989(1060)
Reported by iliasgiarimis
on 2015-02-20 15:21:39
Good work! :)
Forgot to mention, I have Linux mint x64 on both my machines, also I am running 1 to
6 steps and comparing for the sake of it, but I am to lazy to fill in all the columns
of times :).
5820k Phenom2 955
2st. 6st. 2st. 6st.
Org 1250 3560
P0 1010 3380
P1 980 3280
P2 620 2080
P6 560 1135 1925 3580
P7 590 1170 1900 3515
P8 590 1190 1920 3540
P9 550 1150 1880 3500
P10 540 970 1835 2960
P11 460 625 1680 2510
/Reine
Reported by reine.edvardsson
on 2015-02-20 18:33:22
Ilias, Reine, thanks for testing. I added SSE code for another loop.
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-20 20:46:24
Around 1.10X speed improvement for basic LMMSE with patch12 vs patch11 :) .. and faster
2nd pass Lee refinement .. now 2 passes take 2X time vs 1 pass while it used to be
>2X with patches 10-11.
no.patch patch_04 patch_05 patch07 patch08 patch09 patch10
patch11 patch12
capacity 37.5Mp 44.5Mp 44.5Mp 45.3Mp 53.3Mp 53.3Mp
53.3Mpix 53.3
step0 2660 1767 1722 1714 1755 1802
1766 1581
step1 2660 1827 1813 med 1772 med 1779 med 1801 med
1781 med 1649 med
step2 4620 2800 2790(0985) 2729(0952) 2726(0943) 2781(0945)
2158( 384) 1998( 384)
step3 6671 3983 3916(2111) 3850(2066) 3832(2064) 3907(2076)
2590( 763) 2363( 765)
step4 8576 5085 5009(3203) 4934(3160) 4922(3153) 5005(3165)
2935(1156) 2765(1148)
ref ref ref ref ref ref ref
refine refine
step5 10050(1485) 6633(1470) 5873( 738) 5765( 734) 5755( 768) 5639( 719) 5475( 484)
3426( 484) 3269( 484)
step6 11195(2600) 7769(2610) 6662(1552) 6625(1508) 6468(1493) 6409(1498) 6091(1040)
3989(1060) 3720( 973)
Reported by iliasgiarimis
on 2015-02-21 11:11:02
Ilias, thanks for testing.
Here's the patch I would like to commit (after removing the Stopwatches). I also would
like to close the Issue with this patch.
I cleaned the code, made another very very small speedup and introduced two SSE4.1
intrinsics for the users of native x64 builds (in case the cpu supports SSE4.1). One
of the SSE4.1 changes has also some influence on speed of other parts of RT, but I
didn't benchmark this cases.
Thanks to Reine for helping me to find the (hopefully) correct way to include the SSE
header files for Linux.
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-21 21:12:58
Linux Mint x64 on both machines, tested with 36MP D800 image.
5820k Phenom2 955
2st. 6st. 2st. 6st.
Org 1250 2860 3560 8050
P0 1010 3380
P1 980 3280
P2 620 2080
P6 560 1135 1925 3580
P7 590 1170 1900 3515
P8 590 1190 1920 3540
P9 550 1150 1880 3500
P10 540 970 1835 2960
P11 460 645 1680 2510
P12 320 600 1550 2430
P13 300 580 1470 2300
There are differences in the image for p12 and p13 (for p13, the difference from p12
is only on my intel, so related to AVX optimizations I guess), but only scattered pixels
according to ImageMagick compare, I tried to actually see any difference looking at
two images but there wasn't anything that I could see... Most likelly the differences
are like one or two steps in the 16bit tiff and my guess is is rounding differences
(just a guess though :) ).
Spedup on the intel machine: 4-5 times
Spedup on the AMD machine: 2-4 times
Fantastic work Ingo!
/Reine
Reported by reine.edvardsson
on 2015-02-21 22:25:45
Reine, thanks for testing and for your help with SSE includes on Linux. I forgot to
mention that with patch 13 I also enabled FMA (not AVX) at one part of sleef library
for machines with FMA feature. That can lead to very small differences between P12
and P13, because FMA has a bit higher precision (one less rounding step) for this d
= a+b*c operations.
Difference between P11 and P12 is caused by changing some a = b - a to a -= b, which
normally isn't correct, but in this case the results go into a SQR, so it doesn't really
matter.
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-21 22:49:26
FMA explains it, as the old AMD does not support that :)
Thanks for the info!
/Reine
Reported by reine.edvardsson
on 2015-02-21 23:13:59
HDR DNG (*that* one), i7 CPU Q 820 @ 1.73GHz, GCC-4.9.2
Patch 00:
lmmse_interpolate_omp took 648 ms
lmmse_interpolate_omp took 578 ms
lmmse_interpolate_omp took 565 ms
lmmse_interpolate_omp took 538 ms
Patch 13:
median pass took 56 ms
lmmse_interpolate_omp took 380 ms
median pass took 56 ms
lmmse_interpolate_omp took 377 ms
median pass took 57 ms
lmmse_interpolate_omp took 399 ms
No differences in output. Green light for commit and thank you :)
Reported by entertheyoni
on 2015-02-22 01:11:56
Ingo, thanks
no objection for commiting v13 although 32bit win machines still crash on large files
(>53Mp). And it looks like it is not exactly memory missing because RT crashes at the
same 54Mpixel files both when in edit and in queue mode (around 700MB less memory consumption
with queue).
I will test V13 tomorrow ..
Reported by iliasgiarimis
on 2015-02-22 01:15:30
DrSLony, thanks for testing!
Ilias, though I don't expect 'no crash' with large files in Win32, I'll wait with commit
until you tested V13 ;-)
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-22 01:27:58
Ilias, we should think about the 'crashes'. It's no problem to avoid the crashes. But
actually I don't know how to communicate the 'avoid crash (out of menory)' to the user...
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-22 01:39:38
No speed changes with patch13 vs patch12 .. The "nocrash" limit remains also the same
.. at 53.3Mp (
Reported by iliasgiarimis
on 2015-02-22 11:05:49
Ilias, it should be possible to reduce peak memory usage by another width*height*8 bytes.
I'll try that before commit.
Ingo
Reported by heckflosse@i-weyrich.de
on 2015-02-22 11:59:50
Ingo, what I don't understand is why there is no difference regarding the "no-crash
limit" between
- edit mode where RT uses 1.15GB steadily and climbs to 2.22 with LMMSE
- queue where RT uses 35MB and climbs at 1.55GB with LMMSE
Reported by iliasgiarimis
on 2015-02-22 12:29:42
Originally reported on Google Code with ID 2665
Reported by
heckflosse@i-weyrich.de
on 2015-02-11 21:06:05