Beep6581 / RawTherapee

A powerful cross-platform raw photo processing program
https://rawtherapee.com
GNU General Public License v3.0
2.89k stars 323 forks source link

Halide #2934

Closed bobobo1618 closed 6 years ago

bobobo1618 commented 9 years ago

Not sure if you guys are aware of this but Halide seems pretty cool. It's open source, MIT licensed (GPLv3 compatible? I'm not sure) and through a single API can generate code for "x86/SSE, ARM v7/NEON, CUDA, Native Client, and OpenCL".

Given that it enables simpler SSE implementation and easier use of CUDA/OpenCL, it seems like something that could be a good idea for performance.

If I get time to send pull requests implementing/porting image processing to Halide, are they likely to be approved?

iliasg commented 9 years ago

Wow !!. Very impressive elimination of false colors and resolution at black-white transitions. And best readability at the BW text I have seen fo far .. better that LMMSE which was my reference until now.

Although the zipper effect and aliasing at the colored circles is much worse than Amaze .. something like LMMSE but a bit better. Looks like a better color upsampling is needed :)

bobobo1618 commented 9 years ago

Okay, speed still isn't fantastic but it seems to be on par with AMAZE as far as UX is concerned on my machine. Just sent the pull request.

@iliasg the newer improvements the authors developed should help with that. I'll work on bringing those in after this is done :)

bobobo1618 commented 9 years ago

Couple of updates:

Next I'll try to build for Windows so I can test IGD on a GPU and see how that goes. I think I'll try to bring in the algorithm quality updates after that. Then I'll focus more on performance (don't want to put in too much effort until the alg. is finalized).

heckflosse commented 9 years ago

@bobobo1618 Amaze uses all cores of your CPU.

bobobo1618 commented 9 years ago

@heckflosse are there any special compile options or tools it requires (does it require GCC for example)? I'm testing on an i7-4770S (4 physical, 8 logical cores) now and I haven't seen it touch more than a single core (and I've seen IGD eat them alive).

AMAZE hasn't doesn't seem to have touched more than 2. It also consistently takes ~1.6s for a 16MP photo and 2.4s for the 36MP photo. I have a feeling I'm missing some compiler optimization or something based on your earlier numbers (I don't think my CPU is that slow compared to yours?). IGD took ~1.8s for the 36MP image and ate all 8 logical cores alive.

I also tested the OpenCL version of IGD on a GTX970 and it took an average of ~1.2s for the 36MP image. Works on the real GPU though!

Beep6581 commented 9 years ago

OpenMP is enabled by default. Did you specify PROC_TARGET_NUMBER?

cmake -DCMAKE_BUILD_TYPE="release" -DPROC_TARGET_NUMBER="2" -DBUILD_BUNDLE="ON" -DBINDIR="." -DDATADIR="." -DCACHE_NAME_SUFFIX=4 .. && make -j8 install
bobobo1618 commented 9 years ago

I figured it out. Turns out Apple's bundled compiler, despite being labelled "version 7.0.0", has no relation to LLVM 3.7 (the first version to fully support OpenMP). In addition, I'm unable to compile Halide with GCC 5 (or link Halide to RT with GCC 5).

So doesn't seem I can benchmark RT at the moment. I'll work on getting a Linux dev environment running at some point but for now I'll just get the improved algorithm in and work on speeding it up.

bobobo1618 commented 9 years ago

Okay, got IGD running pretty consistently below 550ms now. The Array2D stuff was taking 300ms so I fixed that. The rest is down to Halide scheduling.

It'd be good if someone with a working OpenMP compiler could compare IGD to AMAZE.

So much for the quality improvements coming first, that'll be next.

heckflosse commented 9 years ago

@bobobo1618 I'm using gcc 4.9.x (OpenMp enabled). But unfortunately I don't know how to install Halide on Win64

Ingo

bobobo1618 commented 9 years ago

They have Windows releases here. The process should be something like:

I'm going to try getting Fedora running on my desktop machine so I can play with the GPU though so I'll try compiling properly there as well. I just ran 20 consecutive iterations on my Macbook and got an average of 399ms for the D810 RAW (90MP/s) so I really want to see what a properly optimised GPU schedule can do.

heckflosse commented 9 years ago

@bobobo1618 Ok, thanks for the link. I'll try it this week.

bobobo1618 commented 9 years ago

Well I got Fedora up and running on my Macbook and managed to compile RT with a working OpenMP compiler (GCC 5.1.1) with the config cmake .. -DHALIDE_PATH=$HOME/Build/halide/build -DOPTION_HALIDE_OPENCL=OFF -DCMAKE_CXX_FLAGS="-std=c++11" -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=g++ -DCMAKE_C_COMPILER=gcc -DCMAKE_INSTALL_PREFIX=(pwd)/out -DPROC_TARGET_NUMBER=2 and came out with 1255ms for AMAZE and 461ms on IGD for the 36MP D810 sample (for a rate of 30MP/s and 78MP/s respectively).

It did fully utilize all logical CPU cores this time.

Is there anything else I'm likely to be missing when it comes to AMAZE performance on my laptop or is it just that my laptop's slower than @heckflosse's?

Working on quality now anywho.

iliasg commented 9 years ago

My ancient E8400 (2 cores, no hypethreeds, 3.0Ghz, win32) https://www.cpubenchmark.net/compare.php?cmp[]=1780&cmp[]=2502&cmp[]=955 needs 4.5sec for Amaze on 36Mp files.

xorgy commented 8 years ago

@bobobo1618 cool stuff. Is this sitting somewhere public now?

bobobo1618 commented 8 years ago

@xorgy Yup but I wasn't confident in my ability to maintain it in the future (C++ is very new to me and I'm no expert in this stuff) so the PR wasn't merged.

eszdman commented 1 year ago

Hmm, to debayer (independent on method used) 640 MP you need to transfer 640000000 floats = 2560000000 bytes over PCIe bus to graphics card and 1920000000 floats = 7680000000 from graphics card. To do that transfer in one second you have to transfer 10240000000 Bytes = 9756 MBytes in one second which means you need PCIe 3.0 x 16 or PCIe 2.0 x32. And that doesn't include the time to debayer the data.

Edit: I agree that for a debayer method like Amaze (which needs a lot more calculations than bilinear) a GPU version can be faster than fast multicore cpus, but you always have to take transfer time into account and Amaze is a lot harder to vectorize than bilinear.

Ingo

It is not necessary to constantly copy data between pipeline passes if it will all run on the GPU