Closed bobobo1618 closed 6 years ago
@bobobo1618 you're welcome to send pull requests :-) I will test them on Windows and Linux then
Ingo
I just got a bilinear debayer (I'm new to this and wanted something simple) running at 640MP/s on my laptop so I have high hopes for it :)
I'm looking into other demosaicing algorithms though, as some of the latest academic work has suggested that algorithms like DFPD are pretty promising. I'm curious how they compare to AMAZE though and couldn't see much literature mentioning it despite it being around for around 7 years. Is anyone aware of any comparisons?
Hmm, to debayer (independent on method used) 640 MP you need to transfer 640000000 floats = 2560000000 bytes over PCIe bus to graphics card and 1920000000 floats = 7680000000 from graphics card. To do that transfer in one second you have to transfer 10240000000 Bytes = 9756 MBytes in one second which means you need PCIe 3.0 x 16 or PCIe 2.0 x32. And that doesn't include the time to debayer the data.
Edit: I agree that for a debayer method like Amaze (which needs a lot more calculations than bilinear) a GPU version can be faster than fast multicore cpus, but you always have to take transfer time into account and Amaze is a lot harder to vectorize than bilinear.
Ingo
I suppose I should elaborate on that number a bit then:
dcraw -d
).I've run other tests as well though:
Those tests used 32-bit unsigned integers rather than 16-bit since the element size was doubled.
I'd test on a proper GPU (I have a GTX 970 here) but it's in a Windows box and I can't be bothered dealing with the hell that is development on a non-UNIX system.
Ok, that fits better to what I get here. IIRC (didn't measure since a few months) bilinear debayer of float input and output on my CPU is about 360 MP/s (8-core AMD FX 4GHz). I measured using a D800 36MP file.
Edit: Amaze is at about 62 MP/s on my 8-core AMD (float in and float out)
Ingo
I get about the same when I run on 32-bit floats. I'm only running a dual core i7-5557U (~half as powerful) though. That's on a 24MP Sony A7 II's ARW.
Unfortunately (about the 'half as powerful'):
1) AMD 8-core FX 8350 falls back to below SSE performance when using AVX (4 AVX threads but 8 SSE threads) 2) about three years ago (when I bought it) I didn't take care to buy fast memory, so for algorithms like bilinear the performance is clearly limited by memory bandwidth :( For algorithms with higher computational amount it's a bit better though.
No, there not have been any new quality evaluations of Amaze demosaic since long time. Amaze was (and is) the standard demosaic algorithm in RT (at least for low ISO shots) since Emil http://hamilton.uchicago.edu/~ejm/ implemented it. I made a lot of speedups (reorganizing memory, SSE instructions ans so on) but never evaluated it against newer approaches. I'll take a look at the links you posted this week.
Edit: I really appreciate your approach to speed up rt!!!
Ingo
There are also a couple of surveys (1, 2) that have a less targeted view. The interesting thing to me though is that the "Low Complexity Color Demosaicing Algorithm Based on Integrated Gradient" paper's algorithm competes with the latest and best performing algorithms despite using very little in the way of CPU time.
The papers are pretty easy to read as well so I'll probably end up implementing a couple of them just to help me learn. Writing Halide is fun :)
While you are at it, could you look at parallel computing using Halide ? I mean that RT can process image as a background task while the user can edit an image that would require Halide too. I don't know how the ressource (the GPU) can be used across tasks (maybe a Mutex could be required). I just wanted to point that out before implementing anything.
@bobobo1618 Jacques Desmis (https://github.com/Desmis) have some comparisons .. http://jacques.desmis.perso.neuf.fr/geraud/interpolation.php
@Hombre57 I'm not really the person you want looking into threading and mutexes. I come from Python, where the GIL prevents me from utilizing threads at all. I don't believe the GPU (or at least Halide) has the kind of scheduling you're after though. Once something's running (you call Func.realize
), there's no way I'm aware of to stop it. The only thing I can think of for you to do is to stop a new background task running while a foreground task is already in progress. Not sure how much that'd help.
@iliasg that's helpful but those are more subjective comparisons. I'm looking more for objective PSNR/SSIM/S-CIELab comparisons. They may not be as accurate but they're easier to compare to the existing studies and easier to throw lots of images at. I might attempt to test myself, if I can figure out enough of the RT code to jam it into a test harness.
I am affraid that there is a strong possibility that PSNR/SSIM etc metrics can turn you to a wrong direction regarding demosaic quality. Also given that the reference samples (Kodac set etc) are not that good for digital photo evaluation (they are proccessed film scans with grain and halos etc..)
Hmm, that might be true. I'll try implementing the algorithms I'm interested in and run them on Jacques' test images then.
A very interesting frame is Dpreview's studio still life shot. There we find some difficult areas
Just to give an update, I have good news and bad news.
Good news:
Bad news:
This happens:
I likely need to have a look into the stuff that happens around edges. I suspect it's an overflow of some kind as it only seems to happen around extremes. Less extreme images seem to be fine (and pleasantly lacking in moire):
The code is here if anyone has a couple of minutes to look for glaring mistakes but I understand that Halide code is a little odd.
I am to only (?) one who cannot read code here .. but
You need to clip out the values outside the white level - black level and scale the data using WB multipliers i.e. calculate the average for each of R, (G1+G2)/2, B) or read the WB multipliers from the exif and then demosaic, Not sure about the exact order BL/WL/WB .. The artifact you see could be from not applying WB before demosaic ;)
After demosaic, apply the color matrix or if it is difficult just provide a tiff with raw colors and we can build an icc profile which applied as custom input profile in RT will give a correct rendering. We just need a colochecker in the shot or just the knowledge about the exact camera model used :)
Turned out just clamping the values before casting was all I needed to do so that's done.
Full resolution sample tiff here
There are still some artifacts around the edges for me to look into. There were some things I omitted from the paper due to laziness (high/low pass filtering) that may help.
Sample of artifacts (zoom in):
Thanks for the info @iliasg, I'll look into it :)
Which is the camera of this raw sample ? Can you upload it ?.
I believe it's a Nikon D3x. It's D3x_100.NEF
from Jaques' comparisons. There's a link to download it on his page.
OK, then you have no problems with black and white levels as they extend at the limits of 12bit (0-4095). The WB multipliers are R=2.03125 B=1.30078 G1=G2=1.00000
Fixed the WB a little somehow. Still some minor artifacting though.
I have a feeling this is related to the high/low pass filter the paper discussed.
Looks like it might be outperformed by Amaze though. I managed to run it on another of Jaques' samples and the Moire example didn't seem to work too well, leaving things a little discoloured:
So it turns out my code had a lot of bugs. I've worked through most of them now and I can now produce images without artifacts and in the case of some cameras, images that are black levelled, white balanced, demosaiced, color space converted and gamma curved.
An example from the Moire image (there but minor):
And the chart (couldn't find a single artifact?):
I had a look at some other images and I found that it performs roughly equivalent to AMAZE without the false color suppression steps.
Working on getting it performing now. It takes ~10s for a 16MP image right now (although that includes white balancing and whatnot too).
Promising result :)
although the pullover has wrong color, in Nikon D70 red and blue are in inverse order ..
B G2
G1 R
instead of the usual
R G1
G2 B
Ohhh, that's what that was. Fixed.
I should read the CFA pattern from Exif I guess.
I guess that until you can use RT's structure regarding raw's decoding you should transfer dcraw's decoding in halide, but as there are too many raw formats .. a good start would be to just transfer the related to DNG code and convert to DNG any test file you like :)
BTW, how do you plan RT's cooperation with halide code ?.
I'm already using libraw (dcraw) to handle the file loading, the problem is that there's no standard when it comes to CFA layout :)
Not sure yet, I'll read through it a bit later and see if I can jam this into RT right now.
in dcraw you can check the cfa layout with the macro FC(row,col). It returns the CFA colour of the corresponding pixel
Looks like that's the same in the RT code. There are only 4 possible layouts (RGGB, GRBG, GBRG, BGGR) to work with so I'll just test two of the first pixels to figure out which it is.
I'm going to work on performance a bit and then work on getting it into RT. Looking at fast_demo.cc
for an example it looks like it should be pretty straightforward as far as code is concerned. I the main problem is mapping RT's rawData
into Halide's Image
. I think, if array2D
is laid out rows first, I can just wrap a buffer_t
around the data
pointer and construct an Image
. If I can do that, I can pass it straight through to Halide and done.
Can anyone more familiar with RT comment on how array2D
and particularly rawData
is laid out in memory?
And is there a way I can constrain the Bayer patterns that my demosaic will work with or can I assume they're all 2x2 square with B and R, G1 and G2 diagonally opposite?
array2D (and so rawData too) is a contiguous block of memory and the layout is rows first. In rtengine/rawimage.h there are functions to check the sensor layout. You should check for isBayer()
How should I respond if the sensor isn't bayer? Looking at the other demosaics (like AMAZE), it looks like they simply continue anyway?
Also I got the consent of the authors of this paper to contribute my implementation of their algorithm (I felt this was important).
I'm thinking of porting over the nyquist filtering from AMAZE as well though, since the main difficulty my implementation is having is on textiles.
That's checked in rtengine/rawimagesource.cc line 1785..
On the subject of performance, I've got my implementation running at ~7MP/s right now while AMAZE seems to run at ~3.2MP/s on my laptop. Do those numbers sound reasonable? Too low? Too high? I'm far from done when it comes to tuning but I'm wondering what I should aim for for this to be considered worthwhile.
@bobobo1618 On my 8-core AMD AMAZE needs ~ 600 ms for a D800 (36 MP) file, which is ~ 60 MP/s.
Wow, I must be doing something very wrong with RT then. How are you getting that number? I just switched to another demosaic in RT, zoomed out all the way, then switched back and timed how long it took to load again. Is there a better way to do it?
Oh, don't worry. I built RT in debug mode. That was silly of me. Back to performance I go...
Which revision of RT do you use? If you want to measure in RT just use the StopWatch. I.e. to measure Amaze insert this line: StopWatch measure("Amaze"); in line 40 of amaze_demosaic_RT.cc
You have to
too.
I always measure in queue (put image 7 times to queue, start queue) and take the median of this 7 values. Measuring in queue is the best method because it avoids influence of progressbar updates. Your 3.2 MP/s seem really very low. My old laptop (Pentium Dual-Core T4500@2.3 GHz) needs ~1.7 seconds for a D700 file (12MP), which is ~7MP/s.
Ingo
:)
I'm building from Git.
Seems I have a long way to go on performance. I'm at ~15MP/s now but I have a feeling I'll be able to push it past 60MP/s with a bit of work. Just need to figure out how CPUs work...
@bobobo1618 I mostly got the biggest performance gains by reducing memory transfers which often is accomplished by changing the layout of data in memory.
Okay, I haven't made many gains on performance but I've got the code into RawTherapee. Problem is that Halide won't compile when embedded in RT. When I take the exact same code (as in compile the exact same .cpp
, inside RT's source tree) into another application, calling it with the same arguments, it runs fine. When I compile it into RT with the same compiler, same Halide library etc., it fails as per the Halide bug I filed.
I'm not intimately familiar with the internals of either of these projects but I've filed a bug with Halide and it'd be great if any of you have any clue what's going on, since the issue seems unique to RT.
(I have an example of the exact same code working without problems outside RT here)
Oh and I should be clear that the compilation failure is the JIT compilation of the Halide code, not the compiling of the RawTherapee binary (the compilation failure is at runtime). Everything builds and links at the C++ level fine.
Okay, from what the Halide guys are saying it seems it's likely due to Halide using a deep recursive stack during compilation.
I've AOT compiled the code and integrated it with RT but now the issue is extracting the data from Halide's buffer and getting it into RT's RGB planes. As I understand it I need to fill the red
, green
and blue
array2D<float>
objects. Halide's memory is laid out in row major planes of each colour in RGB order so what I've been doing is grabbing pointers to the start of each plane (the address of the first pixel) and trying to use that to build an array2D:
// output_image.address_of(x, y, c) returns a void *pointer.
float *redaddr = (float *)output_image.address_of(0, 0, 0);
float *greenaddr = (float *)output_image.address_of(0, 0, 1);
float *blueaddr = (float *)output_image.address_of(0, 0, 2);
red = *(new array2D<float>(W, H, (float **) &redaddr, 0));
green = *(new array2D<float>(W, H, (float **) &greenaddr, 0));
blue = *(new array2D<float>(W, H, (float **) &blueaddr, 0));
This currently returns EXC_BAD_ACCESS
(a Mac segfault I believe) when the array2D constructor attempts to copy the contents (at the first instance).
I'm new to C so are there any obvious mistakes in the above?
Maybe this works?
float *redaddr[H];
float *greenaddr[H]:
float *blueaddr[H];
for(int i=0;i<H;i++) {
redaddr[i] = (float *)output_image.address_of(i, 0, 0);
greenaddr[i] = (float *)output_image.address_of(i, 0, 1);
blueaddr[i] = (float *)output_image.address_of(i, 0, 2);
}
red = *(new array2D<float>(W, H, redaddr, 0));
green = *(new array2D<float>(W, H, greenaddr, 0));
blue = *(new array2D<float>(W, H, blueaddr, 0));
It turned out that I wasn't initializing the output buffer object for Halide correctly. Now that I've fixed that, I'm having issues with Halide. Working with them to get over that now...
So got the Halide code working and got the buffer filling working (Halide code is now running in RT)! Halide example code (rather than docs and tutorials) fixed the buffer issues I was running into and @heckflosse's suggestion got the buffers filling properly.
So my implementation of the new algorithm, while slow, is now integrated with RT.
Next steps are:
Once those are done I'll send a PR.
After that I'll look into more fun things like using the authors' followup enhancements (which they've suggested to me may help a bit with moire) and jamming Halide into more parts of the codebase.
Fine !! Can you convert a Nikon D810 raw (sample from a sensor with no aa filter ..) to tiff 16bit using RT neutral and upload http://www.dpreview.com/reviews/image-comparison/download-image?s3Key=68bbe934a667407ebbda0b85267c8418.nef ?.
Done!
Not sure if you guys are aware of this but Halide seems pretty cool. It's open source, MIT licensed (GPLv3 compatible? I'm not sure) and through a single API can generate code for "x86/SSE, ARM v7/NEON, CUDA, Native Client, and OpenCL".
Given that it enables simpler SSE implementation and easier use of CUDA/OpenCL, it seems like something that could be a good idea for performance.
If I get time to send pull requests implementing/porting image processing to Halide, are they likely to be approved?