bobobo1618 commented 9 years ago

Not sure if you guys are aware of this but Halide seems pretty cool. It's open source, MIT licensed (GPLv3 compatible? I'm not sure) and through a single API can generate code for "x86/SSE, ARM v7/NEON, CUDA, Native Client, and OpenCL".

Given that it enables simpler SSE implementation and easier use of CUDA/OpenCL, it seems like something that could be a good idea for performance.

If I get time to send pull requests implementing/porting image processing to Halide, are they likely to be approved?

heckflosse commented 9 years ago

@bobobo1618 you're welcome to send pull requests :-) I will test them on Windows and Linux then

Ingo

bobobo1618 commented 9 years ago

I just got a bilinear debayer (I'm new to this and wanted something simple) running at 640MP/s on my laptop so I have high hopes for it :)

I'm looking into other demosaicing algorithms though, as some of the latest academic work has suggested that algorithms like DFPD are pretty promising. I'm curious how they compare to AMAZE though and couldn't see much literature mentioning it despite it being around for around 7 years. Is anyone aware of any comparisons?

heckflosse commented 9 years ago

Hmm, to debayer (independent on method used) 640 MP you need to transfer 640000000 floats = 2560000000 bytes over PCIe bus to graphics card and 1920000000 floats = 7680000000 from graphics card. To do that transfer in one second you have to transfer 10240000000 Bytes = 9756 MBytes in one second which means you need PCIe 3.0 x 16 or PCIe 2.0 x32. And that doesn't include the time to debayer the data.

Edit: I agree that for a debayer method like Amaze (which needs a lot more calculations than bilinear) a GPU version can be faster than fast multicore cpus, but you always have to take transfer time into account and Amaze is a lot harder to vectorize than bilinear.

Ingo

bobobo1618 commented 9 years ago

I suppose I should elaborate on that number a bit then:

Debayer was run entirely on CPU. GPU wasn't touched.
Debayer was pre-compiled (didn't compile it 100 times).
Debayer was run 100 times, the figure I gave is the average over that time.
Data was 8-bit (dcraw -d).
I was operating on 16-bit unsigned integers (large enough to contain the sum of two 8-bit integers), not floats.
Data was in memory prior to running the benchmark.
Data was 16MP, the 640MP/s I mentioned was the rate (16/(time_to_compute)). I don't actually have a 640MP image to test.
Halide was using AVX, AVX2 and SSE4.1 instructions, as well as parallelisation.

I've run other tests as well though:

8-bit + GPU (Intel Iris 6100, Metal API) = 800MP/s.
16-bit (DNG) + CPU = 342MP/s.
16-bit (DNG) + GPU (Intel Iris 6100, Metal API) = 406MP/s (this includes the cost of copying from RAM to GPU and back, although since this is an Iris GPU, I believe it shares system RAM).
16-bit (DNG) + GPU (Intel Iris 6100, OpenCL API) = 404MP/s.

Those tests used 32-bit unsigned integers rather than 16-bit since the element size was doubled.

I'd test on a proper GPU (I have a GTX 970 here) but it's in a Windows box and I can't be bothered dealing with the hell that is development on a non-UNIX system.

heckflosse commented 9 years ago

Ok, that fits better to what I get here. IIRC (didn't measure since a few months) bilinear debayer of float input and output on my CPU is about 360 MP/s (8-core AMD FX 4GHz). I measured using a D800 36MP file.

Edit: Amaze is at about 62 MP/s on my 8-core AMD (float in and float out)

Ingo

bobobo1618 commented 9 years ago

I get about the same when I run on 32-bit floats. I'm only running a dual core i7-5557U (~half as powerful) though. That's on a 24MP Sony A7 II's ARW.

heckflosse commented 9 years ago

Unfortunately (about the 'half as powerful'):

1) AMD 8-core FX 8350 falls back to below SSE performance when using AVX (4 AVX threads but 8 SSE threads) 2) about three years ago (when I bought it) I didn't take care to buy fast memory, so for algorithms like bilinear the performance is clearly limited by memory bandwidth :( For algorithms with higher computational amount it's a bit better though.

bobobo1618 commented 9 years ago

Fair enough, I was just looking at the number for a vague idea of how it compared.

What about the debayer algorithm quality performance though? I'm thinking of this or DFPD but I don't know if AMAZE already surpasses it or not. Have there been any quality evaluations of AMAZE?

heckflosse commented 9 years ago

No, there not have been any new quality evaluations of Amaze demosaic since long time. Amaze was (and is) the standard demosaic algorithm in RT (at least for low ISO shots) since Emil http://hamilton.uchicago.edu/~ejm/ implemented it. I made a lot of speedups (reorganizing memory, SSE instructions ans so on) but never evaluated it against newer approaches. I'll take a look at the links you posted this week.

Edit: I really appreciate your approach to speed up rt!!!

Ingo

bobobo1618 commented 9 years ago

There are also a couple of surveys (1, 2) that have a less targeted view. The interesting thing to me though is that the "Low Complexity Color Demosaicing Algorithm Based on Integrated Gradient" paper's algorithm competes with the latest and best performing algorithms despite using very little in the way of CPU time.

The papers are pretty easy to read as well so I'll probably end up implementing a couple of them just to help me learn. Writing Halide is fun :)

Hombre57 commented 9 years ago

While you are at it, could you look at parallel computing using Halide ? I mean that RT can process image as a background task while the user can edit an image that would require Halide too. I don't know how the ressource (the GPU) can be used across tasks (maybe a Mutex could be required). I just wanted to point that out before implementing anything.

iliasg commented 9 years ago

@bobobo1618 Jacques Desmis (https://github.com/Desmis) have some comparisons .. http://jacques.desmis.perso.neuf.fr/geraud/interpolation.php

bobobo1618 commented 9 years ago

@Hombre57 I'm not really the person you want looking into threading and mutexes. I come from Python, where the GIL prevents me from utilizing threads at all. I don't believe the GPU (or at least Halide) has the kind of scheduling you're after though. Once something's running (you call Func.realize), there's no way I'm aware of to stop it. The only thing I can think of for you to do is to stop a new background task running while a foreground task is already in progress. Not sure how much that'd help.

@iliasg that's helpful but those are more subjective comparisons. I'm looking more for objective PSNR/SSIM/S-CIELab comparisons. They may not be as accurate but they're easier to compare to the existing studies and easier to throw lots of images at. I might attempt to test myself, if I can figure out enough of the RT code to jam it into a test harness.

iliasg commented 9 years ago

I am affraid that there is a strong possibility that PSNR/SSIM etc metrics can turn you to a wrong direction regarding demosaic quality. Also given that the reference samples (Kodac set etc) are not that good for digital photo evaluation (they are proccessed film scans with grain and halos etc..)

bobobo1618 commented 9 years ago

Hmm, that might be true. I'll try implementing the algorithms I'm interested in and run them on Jacques' test images then.

iliasg commented 9 years ago

A very interesting frame is Dpreview's studio still life shot. There we find some difficult areas

text at the center of the frame which is prone to false color especially with sensors with no aa filter (LR fails misebly there, Amaze is much better but still gives some artifacts, LMMSE and IGV are fine)
Color waveplates (concentric rings) where amaze excels with it's anti zipper feature, LMMSE/IGV fail miserably
the painting which is very prone to aliasing/moire
many high detail areas both in high and low contrast ..

bobobo1618 commented 9 years ago

Just to give an update, I have good news and bad news.

Good news:

I implemented the algorithm from the paper in Halide and it mostly works (hasn't been hooked into RT yet of course)

Bad news:

This happens:

I likely need to have a look into the stuff that happens around edges. I suspect it's an overflow of some kind as it only seems to happen around extremes. Less extreme images seem to be fine (and pleasantly lacking in moire):
I haven't optimised it at all yet and it's slow.
My testing infrastructure lacks other basics of RAW processing like black/white levelling, colour matrices, white balance and curves for now so samples are going to be pure debayered data.

The code is here if anyone has a couple of minutes to look for glaring mistakes but I understand that Halide code is a little odd.

iliasg commented 9 years ago

I am to only (?) one who cannot read code here .. but

You need to clip out the values outside the white level - black level and scale the data using WB multipliers i.e. calculate the average for each of R, (G1+G2)/2, B) or read the WB multipliers from the exif and then demosaic, Not sure about the exact order BL/WL/WB .. The artifact you see could be from not applying WB before demosaic ;)

After demosaic, apply the color matrix or if it is difficult just provide a tiff with raw colors and we can build an icc profile which applied as custom input profile in RT will give a correct rendering. We just need a colochecker in the shot or just the knowledge about the exact camera model used :)

bobobo1618 commented 9 years ago

Turned out just clamping the values before casting was all I needed to do so that's done.

Full resolution sample tiff here

There are still some artifacts around the edges for me to look into. There were some things I omitted from the paper due to laziness (high/low pass filtering) that may help.

Sample of artifacts (zoom in): test

Thanks for the info @iliasg, I'll look into it :)

iliasg commented 9 years ago

Which is the camera of this raw sample ? Can you upload it ?.

bobobo1618 commented 9 years ago

I believe it's a Nikon D3x. It's D3x_100.NEF from Jaques' comparisons. There's a link to download it on his page.

iliasg commented 9 years ago

OK, then you have no problems with black and white levels as they extend at the limits of 12bit (0-4095). The WB multipliers are R=2.03125 B=1.30078 G1=G2=1.00000

bobobo1618 commented 9 years ago

Fixed the WB a little somehow. Still some minor artifacting though.

test2

I have a feeling this is related to the high/low pass filter the paper discussed.

Looks like it might be outperformed by Amaze though. I managed to run it on another of Jaques' samples and the Moire example didn't seem to work too well, leaving things a little discoloured:

test3

bobobo1618 commented 9 years ago

So it turns out my code had a lot of bugs. I've worked through most of them now and I can now produce images without artifacts and in the case of some cameras, images that are black levelled, white balanced, demosaiced, color space converted and gamma curved.

An example from the Moire image (there but minor): moire

And the chart (couldn't find a single artifact?):

chartcrop

I had a look at some other images and I found that it performs roughly equivalent to AMAZE without the false color suppression steps.

Working on getting it performing now. It takes ~10s for a 16MP image right now (although that includes white balancing and whatnot too).

iliasg commented 9 years ago

Promising result :) although the pullover has wrong color, in Nikon D70 red and blue are in inverse order .. B G2 G1 R
instead of the usual R G1 G2 B

bobobo1618 commented 9 years ago

Ohhh, that's what that was. Fixed.

test

I should read the CFA pattern from Exif I guess.

iliasg commented 9 years ago

I guess that until you can use RT's structure regarding raw's decoding you should transfer dcraw's decoding in halide, but as there are too many raw formats .. a good start would be to just transfer the related to DNG code and convert to DNG any test file you like :)

BTW, how do you plan RT's cooperation with halide code ?.

bobobo1618 commented 9 years ago

I'm already using libraw (dcraw) to handle the file loading, the problem is that there's no standard when it comes to CFA layout :)

Not sure yet, I'll read through it a bit later and see if I can jam this into RT right now.

heckflosse commented 9 years ago

in dcraw you can check the cfa layout with the macro FC(row,col). It returns the CFA colour of the corresponding pixel

bobobo1618 commented 9 years ago

Looks like that's the same in the RT code. There are only 4 possible layouts (RGGB, GRBG, GBRG, BGGR) to work with so I'll just test two of the first pixels to figure out which it is.

I'm going to work on performance a bit and then work on getting it into RT. Looking at fast_demo.cc for an example it looks like it should be pretty straightforward as far as code is concerned. I the main problem is mapping RT's rawData into Halide's Image. I think, if array2D is laid out rows first, I can just wrap a buffer_t around the data pointer and construct an Image. If I can do that, I can pass it straight through to Halide and done.

Can anyone more familiar with RT comment on how array2D and particularly rawData is laid out in memory?

And is there a way I can constrain the Bayer patterns that my demosaic will work with or can I assume they're all 2x2 square with B and R, G1 and G2 diagonally opposite?

heckflosse commented 9 years ago

array2D (and so rawData too) is a contiguous block of memory and the layout is rows first. In rtengine/rawimage.h there are functions to check the sensor layout. You should check for isBayer()

bobobo1618 commented 9 years ago

How should I respond if the sensor isn't bayer? Looking at the other demosaics (like AMAZE), it looks like they simply continue anyway?

Also I got the consent of the authors of this paper to contribute my implementation of their algorithm (I felt this was important).

I'm thinking of porting over the nyquist filtering from AMAZE as well though, since the main difficulty my implementation is having is on textiles.

heckflosse commented 9 years ago

That's checked in rtengine/rawimagesource.cc line 1785..

bobobo1618 commented 9 years ago

On the subject of performance, I've got my implementation running at ~7MP/s right now while AMAZE seems to run at ~3.2MP/s on my laptop. Do those numbers sound reasonable? Too low? Too high? I'm far from done when it comes to tuning but I'm wondering what I should aim for for this to be considered worthwhile.

heckflosse commented 9 years ago

@bobobo1618 On my 8-core AMD AMAZE needs ~ 600 ms for a D800 (36 MP) file, which is ~ 60 MP/s.

bobobo1618 commented 9 years ago

Wow, I must be doing something very wrong with RT then. How are you getting that number? I just switched to another demosaic in RT, zoomed out all the way, then switched back and timed how long it took to load again. Is there a better way to do it?

bobobo1618 commented 9 years ago

Oh, don't worry. I built RT in debug mode. That was silly of me. Back to performance I go...

heckflosse commented 9 years ago

Which revision of RT do you use? If you want to measure in RT just use the StopWatch. I.e. to measure Amaze insert this line: StopWatch measure("Amaze"); in line 40 of amaze_demosaic_RT.cc

You have to

include "StopWatch.h"

too.

I always measure in queue (put image 7 times to queue, start queue) and take the median of this 7 values. Measuring in queue is the best method because it avoids influence of progressbar updates. Your 3.2 MP/s seem really very low. My old laptop (Pentium Dual-Core T4500@2.3 GHz) needs ~1.7 seconds for a D700 file (12MP), which is ~7MP/s.

Ingo

heckflosse commented 9 years ago

:)

bobobo1618 commented 9 years ago

I'm building from Git.

Seems I have a long way to go on performance. I'm at ~15MP/s now but I have a feeling I'll be able to push it past 60MP/s with a bit of work. Just need to figure out how CPUs work...

heckflosse commented 9 years ago

@bobobo1618 I mostly got the biggest performance gains by reducing memory transfers which often is accomplished by changing the layout of data in memory.

bobobo1618 commented 9 years ago

Okay, I haven't made many gains on performance but I've got the code into RawTherapee. Problem is that Halide won't compile when embedded in RT. When I take the exact same code (as in compile the exact same .cpp, inside RT's source tree) into another application, calling it with the same arguments, it runs fine. When I compile it into RT with the same compiler, same Halide library etc., it fails as per the Halide bug I filed.

I'm not intimately familiar with the internals of either of these projects but I've filed a bug with Halide and it'd be great if any of you have any clue what's going on, since the issue seems unique to RT.

bobobo1618 commented 9 years ago

(I have an example of the exact same code working without problems outside RT here)

bobobo1618 commented 9 years ago

Oh and I should be clear that the compilation failure is the JIT compilation of the Halide code, not the compiling of the RawTherapee binary (the compilation failure is at runtime). Everything builds and links at the C++ level fine.

bobobo1618 commented 9 years ago

Okay, from what the Halide guys are saying it seems it's likely due to Halide using a deep recursive stack during compilation.

I've AOT compiled the code and integrated it with RT but now the issue is extracting the data from Halide's buffer and getting it into RT's RGB planes. As I understand it I need to fill the red, green and blue array2D<float> objects. Halide's memory is laid out in row major planes of each colour in RGB order so what I've been doing is grabbing pointers to the start of each plane (the address of the first pixel) and trying to use that to build an array2D:

// output_image.address_of(x, y, c) returns a void *pointer.
float *redaddr = (float *)output_image.address_of(0, 0, 0);
float *greenaddr = (float *)output_image.address_of(0, 0, 1);
float *blueaddr = (float *)output_image.address_of(0, 0, 2);

red = *(new array2D<float>(W, H, (float **) &redaddr, 0));
green = *(new array2D<float>(W, H, (float **) &greenaddr, 0));
blue = *(new array2D<float>(W, H, (float **) &blueaddr, 0));

This currently returns EXC_BAD_ACCESS (a Mac segfault I believe) when the array2D constructor attempts to copy the contents (at the first instance).

I'm new to C so are there any obvious mistakes in the above?

heckflosse commented 9 years ago

Maybe this works?

float *redaddr[H];
float *greenaddr[H]:
float *blueaddr[H];
for(int i=0;i<H;i++) {
redaddr[i] = (float *)output_image.address_of(i, 0, 0);
greenaddr[i] = (float *)output_image.address_of(i, 0, 1);
blueaddr[i] = (float *)output_image.address_of(i, 0, 2);
}
red = *(new array2D<float>(W, H, redaddr, 0));
green = *(new array2D<float>(W, H, greenaddr, 0));
blue = *(new array2D<float>(W, H, blueaddr, 0));

bobobo1618 commented 9 years ago

It turned out that I wasn't initializing the output buffer object for Halide correctly. Now that I've fixed that, I'm having issues with Halide. Working with them to get over that now...

bobobo1618 commented 9 years ago

So got the Halide code working and got the buffer filling working (Halide code is now running in RT)! Halide example code (rather than docs and tutorials) fixed the buffer issues I was running into and @heckflosse's suggestion got the buffers filling properly.

So my implementation of the new algorithm, while slow, is now integrated with RT.

Next steps are:

~~Making the debayer configuration generic instead of hard coded to Sony/Olympus layout~~ (done)
Making it fast
Integrating the Halide AOT compilation with CMake
~~Removing Halide headers/libs from runtime~~ (done, inclusion of Halide generated code in a build is optional as well, since building will require Halide installed)

Once those are done I'll send a PR.

After that I'll look into more fun things like using the authors' followup enhancements (which they've suggested to me may help a bit with moire) and jamming Halide into more parts of the codebase.

iliasg commented 9 years ago

Fine !! Can you convert a Nikon D810 raw (sample from a sensor with no aa filter ..) to tiff 16bit using RT neutral and upload http://www.dpreview.com/reviews/image-comparison/download-image?s3Key=68bbe934a667407ebbda0b85267c8418.nef ?.

bobobo1618 commented 9 years ago

Done!

Beep6581 / RawTherapee

Halide #2934

include "StopWatch.h"