cgohlke / imagecodecs

Image transformation, compression, and decompression codecs
https://pypi.org/project/imagecodecs
BSD 3-Clause "New" or "Revised" License
121 stars 24 forks source link

Use libjpeg-turbo for all Lossless JPEG bit depths #105

Closed SimonSegerblomRex closed 2 months ago

SimonSegerblomRex commented 5 months ago

Enabled by the solution to https://github.com/libjpeg-turbo/libjpeg-turbo/issues/768 (Planned to be included in libjpeg-turbo release 3.1.0.)

There are still some Lossless JPEG encoded images that libjpeg-turbo refuses to decode, see the discussions in:

Note to self while testing with local copy of libjpeg-turbo: Put this in the customize_build function used by setup.py:

libjpeg_turbo_path = <path to libjpeg-turbo>
EXTENSIONS['jpeg8']['sources'] = []
EXTENSIONS['jpeg8']['include_dirs'] = [libjpeg_turbo_path + "/src"]  # moved to src in the dev branch
EXTENSIONS['jpeg8']['library_dirs'] = [libjpeg_turbo_path]

and make sure to set

export LD_LIBRARY_PATH=<libjpeg_turbo_path>

before running any python script importing imagecodecs.

SimonSegerblomRex commented 5 months ago

This is WIP and will stay as a draft pull request until there's an official libjpeg-turbo release that includes the changes necessary.

cgohlke commented 5 months ago

Thanks. I am aware of the ongoing work in libjpeg-turbo. Note that the JPEG codec in imagecodecs switches to the LJPEG codec for bit-depths not supported by libjpeg-turbo.

SimonSegerblomRex commented 5 months ago

Note that the JPEG codec in imagecodecs switches to the LJPEG codec for bit-depths not supported by libjpeg-turbo.

Yes, ljpeg_decode seems to work fine and will still be needed as backup in jpeg_decode for images that libjpeg-turbo refuses to decode due to the issues discussed in https://github.com/libjpeg-turbo/libjpeg-turbo/issues/586 and https://github.com/libjpeg-turbo/libjpeg-turbo/issues/765. ljpeg_encode shouldn't be needed any longer though.

SimonSegerblomRex commented 5 months ago

I tested this with a 16bit Lossless JPEG file as input:

import sys

from imagecodecs import imread, jpeg8_decode, jpeg8_encode
from numpy.testing import assert_array_equal

filename = sys.argv[1]

image = imread(filename)
if image.ndim > 2:
    image = image[..., 0].copy()  # copy to fix strides

for bit_depth in range(16, 1, -1):
    print(bit_depth)
    if bit_depth <= 8 and image.itemsize > 1:
        # FIXME: Should this really be necessary?
        image = image.astype("u1")
    enc = jpeg8_encode(
        image,
        lossless=True,
        predictor=1,
        bitspersample=bit_depth,
    )
    dec = jpeg8_decode(enc)
    assert_array_equal(image, dec)
    image <<= 1

It works, but the case with bit-depth <= 8 in a uint16 array should be handled in a better way.

EDIT: Fixed this with the check here.

SimonSegerblomRex commented 5 months ago

(I replaced the broken dng*.ljp files that were created using my broken Lossless JPEG encoder.)

I did a quick benchmark comparing jpeg8_decode and ljpeg_decode. jpeg8_decode is about ~40 % faster using this input: Pentax-K-1-DNG-extracted.jpg ( 3696x4950, 2 components) (Note: Pentax DNG files are the only images I've found in the wild hit by this problem, so you need that patch to get past the "Bogus Huffman table definition" error.)

Everything seems to work as expected now, but I guess we should wait for an official libjpeg-turbo tag.

SimonSegerblomRex commented 5 months ago

I found this source containing a lot of Lossless JPEG files (embedded in DICOM files). A quick test shows that libjpeg-turbo and lj92 produce slightly different results for some of them, e.g., gdcm-JPEG-LossLessThoravision.dcm. BitsPerSample is 15 and in the decoded arrays there are values as high as 65520 for lj92 and 65535 for libjpeg-turbo... something weird is going on here (even considering that the decoded values are probably supposed to be reinterpreted as signed values or something). Do you have any input regarding this file @malaterre? EDIT: Solved by using gdcmrawto extract the JPEG file. Now this files behaves as expected both with lj92 and libjpeg-turbo.

malaterre commented 5 months ago

I found this source containing a lot of Lossless JPEG files (embedded in DICOM files). A quick test shows that libjpeg-turbo and lj92 produce slightly different results for some of them, e.g., gdcm-JPEG-LossLessThoravision.dcm. BitsPerSample is 15 and in the decoded arrays there are values as high as 65520 for lj92 and 65535 for libjpeg-turbo... something weird is going on here (even considering that the decoded values are probably supposed to be reinterpreted as signed values or something). Do you have any input regarding this file @malaterre?

@SimonSegerblomRex What do you get if you use thorfdbg/libjpeg ?

SimonSegerblomRex commented 5 months ago

@SimonSegerblomRex What do you get if you use thorfdbg/libjpeg ?

With thorfdbg/libjpeg I get:

reading a JPEG file failed - error -1038 - invalid stream, found invalid huffman code in entropy coded segment

and that's probably the right thing. The images decoded by lj92 and libjpeg-turbo are completely broken, so they would have been better off failing as well than trying to decode garbage.

SimonSegerblomRex commented 5 months ago

I found that lj92 fails to decode MARCONI_MxTWin-12-MONO2-JpegLossless-ZeroLengthSQ.dcm (just 0s out) while libjpeg-turbo decodes it without issues :+1: EDIT: Extracting and repairing the JPEG file using gdcmraw it decodes as expected also with lj92.

malaterre commented 5 months ago

@SimonSegerblomRex What do you get if you use thorfdbg/libjpeg ?

With thorfdbg/libjpeg I get:

reading a JPEG file failed - error -1038 - invalid stream, found invalid huffman code in entropy coded segment

and that's probably the right thing. The images decoded by lj92 and libjpeg-turbo are completely broken, so they would have been better off failing as well than trying to decode garbage.

What kind of command did you use ?

% gdcmraw gdcm-JPEG-LossLessThoravision.dcm  /tmp/bla.jpg
% jpeg /tmp/bla.jpg /tmp/bla.pgm
jpeg Copyright (C) 2012-2018 Thomas Richter, University of Stuttgart
and Accusoft

For license conditions, see README.license for details.

0 bytes memory not yet released.

15905134 bytes maximal required.

4197 allocations performed.
SimonSegerblomRex commented 5 months ago

EDIT: Using the output from gdcmraw (that's actually not part of the DICOM file) I get the same output using all three decoders :+1:

First I just used this script to extract the JPEG file:

import re
import struct
import sys

SOI = struct.pack(">H", 0xFFD8)
SOF3 = struct.pack(">H", 0xFFC3)
EOI = struct.pack(">H", 0xFFD9)

with open(sys.argv[1], "rb") as f:
    data = f.read()

matches = re.finditer(b"(?=(" + SOI + b".*?" + SOF3 + b".+?" + EOI + b"))", data, re.S)
for i, match in enumerate(matches):
    with open(f"{i}.jpg", "wb") as f:
        print(i)
        f.write(match.group(1))

It seems like gdcmraw does some magic to repair the broken file.

SimonSegerblomRex commented 5 months ago

This is ready for code review (but there's still no new libjpeg-turbo release or tag).

cgohlke commented 5 months ago

Thank you. I will review this when libjpeg-turbo 3.1 is released.

cgohlke commented 2 months ago

I have tested this with libjpeg-turbo 3.1 beta and it works as expected. The changes will be in the next release of imagecodecs along with some tweaks to make tests pass with libjpeg-turbo 3.0. Thank you.

SimonSegerblomRex commented 2 months ago

Thank you @cgohlke! Do you think you'll have time to publish the new release soon? If not it would be great if you could push your changes to a dev branch, I would like to verify that my main use-case (with two components) still works.

cgohlke commented 2 months ago

The plan is to do a release before Python 3.13, this or next weekend. Hopefully there are no major issues on macOS.

I am attaching the current _jpeg8.pyx, which should be enough for you to test, no?

SimonSegerblomRex commented 2 months ago

The plan is to do a release before Python 3.13, this or next weekend. Hopefully there are no major issues on macOS.

Thank you, sounds good!

I am attaching the current _jpeg8.pyx, which should be enough for you to test, no?

Yes, I confirmed that it works 👍

cgohlke commented 2 months ago

Fixed in imagecodecs 2024.9.22.

SimonSegerblomRex commented 2 months ago

Fixed in imagecodecs 2024.9.22.

Thank you for the new tag @cgohlke! It works well when I build imagecodecs from source against libjpeg-turbo 3.0.90beta. The wheel release was built against libjpeg-turbo 3.0.4, right? Things like encoding images with 14 bits bitspersample as 2 components still doesn't work when installing the wheel. Will you consider making a release built against libjpeg-turbo 3.0.90beta, or will you wait for the 3.1 release? EDIT: ...or is the system library dynamically linked? Then I need to check my environment.

cgohlke commented 2 months ago

The released wheels are built against libjpeg-turbo 3.0.4. I'll wait for version 3.1. No, the system JPEG library is not used. The dynamic JPEG library used is in the imagecodecs/libs directory, with a name like libjpeg-3a8ca8f3.so.8.