RealmeIP / openjpeg

Automatically exported from code.google.com/p/openjpeg
Other
0 stars 0 forks source link

Feature: Add OpenMP support #372

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
OpenMP will allow the library to use all available cores.
On my quad core system, OpenMP decreases encode and decode times
by about 40%.

Patch to follow shortly.

Original issue reported on code.google.com by boxe...@gmail.com on 5 Jul 2014 at 6:27

GoogleCodeExporter commented 8 years ago

Original comment by boxe...@gmail.com on 5 Jul 2014 at 6:30

GoogleCodeExporter commented 8 years ago
The cmake file has a variable called NUM_CORES, which should be set to the 
number
of cores on the system. Todo: pass this variable into the code via a config 
file,
and use it in the OpenMP call omp_set_num_threads.  I dont' know how to do 
this, so I left it.

Original comment by boxe...@gmail.com on 5 Jul 2014 at 6:31

GoogleCodeExporter commented 8 years ago
Now using standard _OPENMP ifdef.
Also, number of processes is now set in the cmake file.

Original comment by boxe...@gmail.com on 6 Jul 2014 at 1:11

Attachments:

GoogleCodeExporter commented 8 years ago

Original comment by antonin on 17 Jul 2014 at 7:29

GoogleCodeExporter commented 8 years ago
I think that support for OpenMP shall be OFF by default (it seems to be the 
case in the patched CMakelist). There's poor support for running OpenMP code 
from manually created pthreads (you can check this in libgomp's bugzilla). This 
might already be fixed in most recent linux distribs by now (I hope so anyway). 
This is definitively buggy on Android. Might be buggy on MacOS & mingw also.
Buggy means crash here.

Original comment by m.darb...@gmail.com on 17 Sep 2014 at 11:51

GoogleCodeExporter commented 8 years ago
Agreed about turning it off by default.

By the way, I was corresponding last month with a user who was trying the 
OpenMP branch out. His observations were:
///////////////////////////////////////////////////////////////////////////////
1) On Windows, after optimizing (Debug to Release, -O2 -O3 flags), speedup from 
openmp is hardly noticeable (~10% or less if at all, can't remember exact)
2) Although code runs on Linux, code is actually slower.
///////////////////////////////////////////////////////////////////////////////

He used mingw with optimizations turned on, while I used visual studio in 
release mode. So, someone needs to do some investigation as to whether this is 
really worthwhile or not.  Unfortunately, I don't have time to do this at the 
moment.

Original comment by boxe...@gmail.com on 17 Sep 2014 at 2:24

GoogleCodeExporter commented 8 years ago
Aaron,

Do you know if he was using multi-component images or not ?
From a quick-review of your patch, I'd say that DWT shall be 3 times faster on 
RGB images.
For T1 part, t1 is allocated/freed for each block. Allocation functions are 
blocking and might be a bottleneck here. You shall allocate NUM_CORES t1 before 
the loop & release them after the loop. You can access the proper t1 object in 
the loop using omp_get_thread_num()

Just a feeling reading your patch.

Original comment by m.darb...@gmail.com on 18 Sep 2014 at 10:23

GoogleCodeExporter commented 8 years ago
Hi Matthieu,

Thanks for taking a look. You're right about the DWT, but this is just a small 
fraction of total encoding time. Interesting hypothesis about allocation 
functions blocking. I tried moving the allocation out of the omp loop, and it 
was actually slower. However, I think we will need to do this anyways in order 
to return OPJ_FALSE if allocation fails. Unless there is a way of handling a 
callback in an OMP thread and aborting the loop. 

Anyways, on my core i7 3770 with windows 7 and visual studio 2012 in release 
mode, I verified that OMP encode is 40% faster than current trunk. So, I think 
this is worth doing.  The largest benefit of OMP is for decoding single images 
with many tiles. For encoding a stream of images, it is faster and easier to 
just launch multiple threads, and let the OS schedule work on multiple cores.

Original comment by boxe...@gmail.com on 18 Sep 2014 at 3:00

GoogleCodeExporter commented 8 years ago
Hi Aaron,

I did some quick tests with Y8 image encoding on a core i7 4770, win 7, vc10 
x86 (So I only modified t1_encode_cblks).
Using your patch (which I think leaks 1 t1 object per block if I'm not 
mistaken, but this must corrected on github I suppose), encoding is 35% faster.
Using allocation outside of the loop is 30% faster (between the 2, inside 
allocation is roughly 8% faster which I found weird).

I made a quick tweak to both solutions adding scheduling (dynamic) chunk size 
of 4. This lead to 40% and 38% faster.
I'll have to check this on linux/macos some day but other things to do right 
now. This test was quick enough to be done right away.

Regarding breaking from OpenMP loop, it's not possible but can be easily (but 
maybe at a performance cost TBD) worked around using a volatile variable or omp 
flush to check in the loop if something shall be done. This does not break the 
loop but make it "empty".

Original comment by m.darb...@gmail.com on 19 Sep 2014 at 11:29

GoogleCodeExporter commented 8 years ago
Hi Matthieu, Aaron,

Is there an updated patch to test (compared to comment #c3 above) ?

I test the provided patch (but had to add a line in CMakeLists.txt to link 
against libgomp) and obtained encoding 40% faster and decoding 16% faster on a 
win7 Core i7, MinGW (but NUM_CORES set to 4)

By the way, is there any way to add openmp support in visual studio express ? I 
did not find one and can therefore only test openmp with MinGW.

Cheers

Original comment by antonin on 19 Sep 2014 at 2:20

GoogleCodeExporter commented 8 years ago

Original comment by antonin on 19 Sep 2014 at 2:20

GoogleCodeExporter commented 8 years ago
Matthieu - thanks for looking at this.  I tried scheduling chunks of four, but 
it got slower - perhaps I was doing it wrong. Would you be able to send me a 
diff of the changes you made?

Antonin - I just pushed a change that fixes the memory leak, and also handles 
out of memory situation (by simply cleaning up and continuing in the loop.  Can 
you check out my github branch?  
https://github.com/OpenJPEG/openjpeg/tree/openmp

Otherwise, I can submit a patch....

Original comment by boxe...@gmail.com on 19 Sep 2014 at 4:07

GoogleCodeExporter commented 8 years ago
Aaron,

I did not keep my modifications. It was only a simple try out.
I'll try to get another look before the end of the week.

Original comment by m.darb...@gmail.com on 30 Sep 2014 at 9:22

GoogleCodeExporter commented 8 years ago
Thanks, Matthieu. I wouldn't spend too much time worrying
about these optimizations at this point.  Just need to get
existing code merged in. 

Original comment by boxe...@gmail.com on 1 Oct 2014 at 1:06

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Per Aaron's request, here are my results testing OpenMP on a quad-core CPU 
using Ubuntu Linux and gcc with -O3 compilation: 
~.24 and ~.18 seconds for compress/decompress for the trunk and ~.33/~.28 for 
OpenMP branch (using a 769kB RGB image attached)

Changes to compile: 
CMakeLists.txt file, I had to change line 24 to: project(${OPENJPEG_NAMESPACE} 
C CXX)
CMake flags: sudo cmake -DBUILD_WITH_OPENMP="ON" -DCMAKE_C_FLAGS="-O3 -fopenmp" 
-DCMAKE_CXX_FLAGS="-O3 -fopenmp" .

I attached the full testing procedure I used in the attached text file. At the 
end of the file is my system properties. Either the problem has to do with 
Linux/compiler differences, OpenMP isn't properly working, or both. However, 
I've run a test OpenMP program with all four cores used properly.

Original comment by electric...@gmail.com on 21 Oct 2014 at 5:11

Attachments:

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Unfortunately, I am only seeing around a 20% speedup using OpenMP. This is 
strange; I guess most of the decode time is not spent in t1. Anyways, I still 
think it is worthwhile to add this in as an option, and hopefully we can find 
other ways of improving the performance.

Original comment by boxe...@gmail.com on 9 Jun 2015 at 6:51

GoogleCodeExporter commented 8 years ago
I think what we are seeing is that T1 decoding is memory bound: adding more 
cores does not significantly increase performance, because the cores are 
memory-starved. Solution is to reduce memory usage in T1. Like carl's patch.

Original comment by boxe...@gmail.com on 10 Jun 2015 at 10:47