opj_compress: 40% of encode time is spent freeing data

GoogleCodeExporter commented 9 years ago

I am investigating compression perf on a 4096x2048 RGB image.

On my core-i7 under windows, it takes 4.389 seconds for lossy compression,
using the default compression settings.

I made a little experiment and allowed the encoder to function with null data. 
i.e. no mct, dwt, t1 or rate distortion was performed. 

Timing for this was a whopping 1.667 seconds.

Next, I ran the Very Sleep profiler (http://www.codersnotes.com/sleepy) (sorry 
windows only).  The profiler was telling me that 89% of the time was being 
spent in the "opj_tcd_code_block_enc_deallocate" method freeing memory.  

We need to investigate ways of reducing this time.  One seemingly simple idea 
is to be able to re-cycle the code blocks for use on the next frame, if one is 
encoding a group of frames (DCP work flow, for example).

I am going to look into this.

Original issue reported on code.google.com by boxe...@gmail.com on 28 Nov 2014 at 8:14

GoogleCodeExporter commented 9 years ago

I have solved this issue: please see my exp branch on github

https://github.com/OpenJPEG/openjpeg/tree/exp

Because I have several commits in the branch, it is easier to see what I did 
on github than by looking at a single patch, or multiple patches.

With these fixes, for the use case where a user is compressing multiple files 
with the same image characteristics (same height, width, bpp, number of 
components) encoding is almost double the speed.  For the DCP use case, for 
example, this will help a lot.

Note: there may be a few memory leaks, because the trunk code assumes that one 
is not re-using the tcd object, for example, so it may not free memory 
correctly.  But, if there is a leak, it must be minor, because I watched the 
memory usage over a long period of time, and it did not grow.

Original comment by boxe...@gmail.com on 29 Nov 2014 at 4:13

GoogleCodeExporter commented 9 years ago

I have combined all of my changes into a single commit: attached is the diff.

Note: I have not run any regression tests, and my own testing involved default 
settings for irreversible compression.

Original comment by boxe...@gmail.com on 30 Nov 2014 at 1:39

Attachments:

double_encode_speed.diff

GoogleCodeExporter commented 9 years ago

@Aaron,

Thanks for the patch. However, there are multiple lines that are removed/added 
because of space/tab alignment issues which makes it hard to read. Do you think 
you can provide a "cleaner" version ?

Original comment by m.darb...@gmail.com on 15 Dec 2014 at 7:21

GoogleCodeExporter commented 9 years ago

Hi Matthieu,
I've cleaned it up a bit. Let me know if this works for you.
Thanks,
Aaron

Original comment by boxe...@gmail.com on 16 Dec 2014 at 12:00

Attachments:

double_encode.diff

GoogleCodeExporter commented 9 years ago

By the way, you can find this patch on a branch of my github repo:

https://github.com/OpenJPEG/openjpeg/tree/double_encode_speed

Original comment by boxe...@gmail.com on 16 Dec 2014 at 12:02

GoogleCodeExporter commented 9 years ago

Aaron,

This diff is easier to read.
A few comments though.

It seems that NULL data is still allowed in this patch for all functions (mct, 
dwt, t1, ...). I guess it shouldn't (& this is what's causing most of the "hard 
to read" patch issues remaining).
The ABI and API are broken with this patch.
I do not feel comfortable patching this as is, maybe others can share there 
opinions ?

Nevertheless, I think something shall be done.
I guess the "easier" thing to do without breaking API/ABI is to allow 
recycling/reuse of some objects as you propose (like tcd/code-blocks)

I did some profiling on my end and focused on problems you raised. For this I 
used opj_compress -i OpenJpeg/data/input/nonregression/ElephantDream_4K.tif -o 
ElephantDream_4K.j2k -cinema4K
  Running Time  Symbol Name
8644.0ms 99.9%  main
8174.0ms 94.5%    opj_j2k_encode
7915.0ms 91.5%      opj_j2k_post_write_tile
 259.0ms  2.9%      opj_tcd_init_tile
 227.0ms  2.6%        calloc
   6.0ms  0.0%        malloc
   4.0ms  0.0%        _platform_bzero$VARIANT$Merom
   1.0ms  0.0%        opj_tgt_create
   1.0ms  0.0%          calloc
 359.0ms  4.1%    tiftoimage
  92.0ms  1.0%    opj_j2k_end_compress
  91.0ms  1.0%      opj_j2k_end_encoding
  90.0ms  1.0%        opj_tcd_destroy
  75.0ms  0.8%          opj_tcd_code_block_enc_deallocate
  56.0ms  0.6%            szone_free_definite_size
  13.0ms  0.1%            deallocate_pages
   5.0ms  0.0%            free
   1.0ms  0.0%            _os_lock_spin_unlock
  13.0ms  0.1%          deallocate_pages
   1.0ms  0.0%          szone_free_definite_size
   1.0ms  0.0%          opj_tgt_destroy
   1.0ms  0.0%        free_large
   1.0ms  0.0%      opj_j2k_write_eoc
  18.0ms  0.2%    opj_destroy_codec
  18.0ms  0.2%      opj_j2k_destroy
  18.0ms  0.2%        opj_image_destroy
  18.0ms  0.2%          free_large
   1.0ms  0.0%    opj_stream_destroy

calloc in opj_tcd_init_tile (in fact in opj_tcd_code_block_enc_allocate but 
inlined or profiler "lost") are taking way too much time. malloc seems the way 
to go here if we don't need a "memset" to 0 (I tried & malloc is way faster on 
my platform but I did not check non regression).

IMHO, a first patch should deal with malloc instead of calloc if possible + 
reuse of tcd/code-blocks. This will allow for API/ABI compatibility with a 
first level of optimization.

I guess the amount of time spent in those allocation functions is related to 
number of times they're called (inner loop of opj_tcd_init_tile) rather than 
overall allocated size so another option might be to allocate a bigger block 
that will be split by OpenJPEG (it might not be trivial though).

Original comment by m.darb...@gmail.com on 16 Dec 2014 at 8:23

GoogleCodeExporter commented 9 years ago

Matthieu,

Thanks for reviewing.  Did you see a significant speed up on your system
with this patch?

I think I could simplify this further:  the code I added to support
encoding without any data could be removed: I was using it for debugging to
isolate the encoding slow-down.

What is left would allow the user to re-cycle the compress objects, rather
than
destroy and re-create for each frame. Do you see an issue with allowing
recycling of compress objects?

Regarding ABI/API changes, I am sure DCP encoder users would be more than
happy to change a little code to get perf improvement of this magnitude.

Original comment by boxe...@gmail.com on 16 Dec 2014 at 9:46

GoogleCodeExporter commented 9 years ago

Here is an even simpler patch.

Original comment by boxe...@gmail.com on 16 Dec 2014 at 10:04

Attachments:

double_encode.diff

GoogleCodeExporter commented 9 years ago

Aaron,

I previously did not test your patch but instead simply profile what your 
pointing out. This is now done.
I'm in no way against recycling objects and I think it shall be working.

Regarding the ABI/API changes, I think that OpenJPEG requires one clean/radical 
change rather than many small ones (2.0 => 2.1 is a small change that requires 
to maintain two branches for minimal features gain, what you propose could be 
seen in the same way requiring 3 branches for little features added). See issue 
439 for my opinion which is, in a few words, a 3.0 is needed, designed in a way 
that will allow for easier enhancement integration without API/ABI being broken.

That being said, all enhancement that can be done in 2.1 without breaking the 
API/ABI shall be merged in quickly (still in my opinion). BTW, as you 
mentioned, more than 80% of the time that this patch saves is with tcd reuse 
which can be done without API/ABI changes.
There are parts of your patch (Those that don't rely on "owns_data" in 
opj_image_comp_t) that could probably merged in.
The tricky part is that it has to work when passing different parameters, 
different image sizes & so on. The test framework actually expect 1 image in, 1 
compressed image out. It would probably be easy to test with different image 
sizes given how opj_compress works. For different parameters, this is another 
thing so more work is probably required on opj_compress to allow this (it could 
take a list of commands in an input file for example).

I won't merge a patch in if I can't test it. I know there are some parts of 
OpenJPEG that aren't tested but I won't take the responsibility of doing such a 
thing. Maybe you can ask Matthieu or Antonin if limitations could apply to such 
an enhancement (like, recycling is OK as long as all parameters are unchanged, 
otherwise crash could happen).  The limitation could be enforced by checks (in 
opj_setup_encoder for example).

Original comment by m.darb...@gmail.com on 17 Dec 2014 at 7:23

GoogleCodeExporter commented 9 years ago

Hello Again,

First of all, let me say I do appreciate your testing these patches, and your 
feedback on this issue.

We can leave it up to the project chefs to decide how to proceed on this one.

Best Regards,
Aaron

Original comment by boxe...@gmail.com on 17 Dec 2014 at 7:41

GoogleCodeExporter commented 9 years ago

Hello Aaron, Matthieu,

Thanks to both of you for having investigated this. 
I agree with Matthieu concerning the way we should implement this speed 
enhancement.

A first patch should done that does not break API/ABI, with appropriate checks 
in order to know if recycling is possible or not. This should be part of 
version 2.1.1 that I would like to release during february.

A second patch would allow API/ABI break but would be part of OpenJPEG 3.0 
(that will follows suggestions in issue 439).

A question I have about this enhancement: is recycling done at the image level 
or at the tile level ? In the former case, is there a way to recycle at the 
tile level, which would allow speed optimization for single image encoding with 
multi-tiles.

Thanks again for your work.

Original comment by antonin on 20 Jan 2015 at 11:18

GoogleCodeExporter commented 9 years ago

Hi Antonin,

The current patch will handle multi-tile encoding. To get the speed boost, 
though, you need a work flow that encodes multiple images with the same image 
characteristics: same width, height, bit depth etc. That is why I mentioned DCP 
workflow. Single tile or multi-tile doesn't affect things. 

Cheers,

Aaron

Original comment by boxe...@gmail.com on 20 Jan 2015 at 2:08

GoogleCodeExporter commented 9 years ago

@Aaron: thanks for the clarification

@Aaron and Matthieu: do you think one of you could prepare a patch that does 
not break API/ABI but recycle objects in case of multiple image encoding ?

Original comment by antonin on 20 Jan 2015 at 5:11

GoogleCodeExporter commented 9 years ago

I'm afraid I don't have time to work on this at the moment.

Original comment by boxe...@gmail.com on 20 Jan 2015 at 7:33

GoogleCodeExporter commented 9 years ago

Well, I re-tested this patch on top of the latest trunk commit: unfortunately, 
(or fortunately) I am only seeing an 11% speedup now.  Still, nice to have, but 
not as dramatic as before.

Original comment by boxe...@gmail.com on 29 Jan 2015 at 7:33

JonBCode / openjpeg

opj_compress: 40% of encode time is spent freeing data #445