DanBloomberg / leptonica

Leptonica is an open source library containing software that is broadly useful for image processing and image analysis applications. The official github repository for Leptonica is: danbloomberg/leptonica. See leptonica.org for more documentation.
Other
1.74k stars 387 forks source link

Leptonica 1.83.0 breaks tesseract, which in turn breaks pdfsandwich #659

Closed swsch closed 1 year ago

swsch commented 1 year ago

Greetings.

After updating a Gentoo box to leptonica 1.83.0, pdfsandwich stopped working. Some experimenting let me pinpoint the problem with leptonica, as you can see in the bug report I filed as #891833 in gentoo's bugzilla.

In short: the same install of pdfsandwich and tesseract fails with leptonica 1.83.0 while it works with 1.82.0.

The relevant parts of pdfsandwich's verbose output:

# pdfsandwich -lang deu -gray -verbose -o 'test.pdf' 20230116_095121_3.pdf
pdfsandwich version 0.1.7
Version: ImageMagick 7.1.0-48 Q16 x86_64 20449 https://imagemagick.org/
Compiler: gcc (12.2)
unpaper 7.0.0
tesseract 5.3.0
 leptonica-1.83.0
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.3) : libpng 1.6.39 : libtiff 4.5.0 : zlib 1.2.13 : libopenjp2 2.5.0
 Found OpenMP 201511
 Found libarchive 3.6.1 zlib/1.2.13 liblzma/5.2.9 bz2lib/1.0.8
 Found libcurl/7.87.0 OpenSSL/1.1.1s zlib/1.2.13 libidn2/2.3.4 nghttp2/1.51.0
GPL Ghostscript 10.00.0 (2022-09-21)
pdfinfo version 23.01.0
pdfunite version 23.01.0
...
Input file: "20230116_095121_3.pdf"
Output file: "test.pdf"
Number of pages in inputfile: 1
More threads than pages. Using 1 threads instead.
Processing page 1.
identify -format "%w\n%h\n"  "/tmp/pdfsandwich_tmp7852e4/pdfsandwich_inputfileb08248.pdf[0]"
convert -units PixelsPerInch  -colorspace gray -depth 8 -background white -flatten -alpha Off -density 300x300  "/tmp/pdfsandwich_tmp7852e4/pdfsandwich_inputfileb08248.pdf[0]" /tmp/pdfsandwich_tmp7852e4/pdfsandwich214f21.pgm
unpaper --overwrite  --no-grayfilter --layout none /tmp/pdfsandwich_tmp7852e4/pdfsandwich214f21.pgm /tmp/pdfsandwich_tmp7852e4/pdfsandwich20e86d_unpaper.pgm
Processing sheet #1: /tmp/pdfsandwich_tmp7852e4/pdfsandwich214f21.pgm -> /tmp/pdfsandwich_tmp7852e4/pdfsandwich20e86d_unpaper.pgm
[pgm_pipe @ 0x55b31216f9c0] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x55b31216f9c0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55b31216f9c0] Encoder did not produce proper pts, making some up.
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmp7852e4/pdfsandwich20e86d_unpaper.pgm /tmp/pdfsandwich_tmp7852e4/pdfsandwich2d5f64.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmp7852e4/pdfsandwich2d5f64.tif /tmp/pdfsandwich_tmp7852e4/pdfsandwich60ad08  -l deu pdf

Error in l_generateCIDataForPdf: cid not made from file
Error during processing.
ERROR: Command "OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmp7852e4/pdfsandwich2d5f64.tif /tmp/pdfsandwich_tmp7852e4/pdfsandwich60ad08  -l deu pdf " failed.
Terminating pdfsandwich. All temporary files are kept.

After replace 1.83.0 with 1.82.0, the same file is handled as expected:


# pdfsandwich -lang deu -gray -verbose -o 'test.pdf' 20230116_095121_3.pdf
pdfsandwich version 0.1.7
...
tesseract 5.3.0
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.3) : libpng 1.6.39 : libtiff 4.5.0 : zlib 1.2.13 : libopenjp2 2.5.0
...
Processing page 1.
identify -format "%w\n%h\n"  "/tmp/pdfsandwich_tmp93c02c/pdfsandwich_inputfile09c03e.pdf[0]"
convert -units PixelsPerInch  -colorspace gray -depth 8 -background white -flatten -alpha Off -density 300x300  "/tmp/pdfsandwich_tmp93c02c/pdfsandwich_inputfile09c03e.pdf[0]" /tmp/pdfsandwich_tmp93c02c/pdfsandwich09012e.pgm
unpaper --overwrite  --no-grayfilter --layout none /tmp/pdfsandwich_tmp93c02c/pdfsandwich09012e.pgm /tmp/pdfsandwich_tmp93c02c/pdfsandwich29d08d_unpaper.pgm
Processing sheet #1: /tmp/pdfsandwich_tmp93c02c/pdfsandwich09012e.pgm -> /tmp/pdfsandwich_tmp93c02c/pdfsandwich29d08d_unpaper.pgm
[pgm_pipe @ 0x562b4dcf59c0] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x562b4dcf59c0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x562b4dcf59c0] Encoder did not produce proper pts, making some up.
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmp93c02c/pdfsandwich29d08d_unpaper.pgm /tmp/pdfsandwich_tmp93c02c/pdfsandwich038143.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmp93c02c/pdfsandwich038143.tif /tmp/pdfsandwich_tmp93c02c/pdfsandwich91968e  -l deu pdf
OCR pdf generated. Renaming output file to /tmp/pdfsandwich_tmp93c02c/pdfsandwich42b301.pdf

OCR done. Writing "test.pdf"
mv "/tmp/pdfsandwich_tmp93c02c/pdfsandwich42b301.pdf" "test.pdf"

test.pdf generated.

Done.
DanBloomberg commented 1 year ago

I believe the problem is in pdfio2.c, lines 569-570.

        if (!cid)
            return ERROR_INT("cid not made from file", __func__, 1);

Please remove those two tlines and see if the test succeeds.

swsch commented 1 year ago

Removing these lines allows processing of similar files without error, so the patch should be good.

Many thanks for quick response.

DanBloomberg commented 1 year ago

Excellent. The fix is now in.

swsch commented 1 year ago

Will there be a point release including the patch? If not, I'll suggest adding the patch to the gentoo package, so that 1.83.0 will be working there, too.

DanBloomberg commented 1 year ago

@stweil

It's a bit of work to make a patch release. I'll follow the advice of the tesseract maintainers, which is why I left this issue open for now.

stweil commented 1 year ago

Are you referring to a patch release 1.83.1? As the latest code is already prepared for 1.84.0, a patch release would need a branch 1.83 (I can add that if you want) and a list of patches which should be added.

Which commits after 1.83.0 should be included in the patch release, too? I'd suggest these commits:

Are there others?

DanBloomberg commented 1 year ago

That's a nice offer, Stefan.

I can also do it without a branch, modifying 1.84.0 --> 1.83.1 and including all existing commits. Then wait a few days before changing 1.83.1 --> 1.84.0.

DanBloomberg commented 1 year ago

But on second thought, it might be easier for you. Those two commits are the only important ones.

stweil commented 1 year ago

See pull request #660 which adds the required changes for 1.83.1 to the new branch 1.83.

DanBloomberg commented 1 year ago

Much thanks, Stefan. Except for a patch on 1.81, this is the only patch that has been required for 5 years, since 1.75.

Closing this issue.