iskunk commented 5 years ago

I am building Leptonica for use in a software system where image I/O and compression are handled by other components. Therefore, I configure with

--without-zlib
--without-libpng
--without-jpeg
--without-giflib
--without-libtiff
--without-libwebp
--without-libopenjpeg

When I build in this manner, however, make check gives rather discouraging results:

FAIL: adaptmap_reg
FAIL: adaptnorm_reg
FAIL: affine_reg
FAIL: alphaops_reg
FAIL: alphaxform_reg
SKIP: baseline_reg
FAIL: bilateral2_reg
FAIL: bilinear_reg
FAIL: binarize_reg
[...]
============================================================================
Testsuite summary for leptonica 1.78.0
============================================================================
# TOTAL: 125
# PASS:  3
# SKIP:  18
# XFAIL: 0
# FAIL:  104
# XPASS: 0
# ERROR: 0

The adaptmap_reg.log file begins with


////////////////////////////////////////////////
////////////////   adaptmap_reg   ///////////////
////////////////////////////////////////////////
leptonica-1.78.0 : (null)
Error in pixReadStreamJpeg: function not present
Error in pixReadStream: jpeg: no pix returned
Error in pixRead: pix not read
Error in pixConvertRGBToGray: pixs not defined
Error in pixaAddPix: pix not defined
[...]

I think it would make more sense for the image files that are used in the library's test suite to be distributed in a format that the library can read without external dependencies, such as bmp or pnm. (Of course, this would not apply in instances where I/O involving formats that require external dependencies is the functionality being tested.)

DanBloomberg commented 5 years ago

If I were to have all the image files that are used for testing in uncompressed format, they would take up about 15 times the current disk space, which I've been careful to keep down to about 10 MB. The approx. 200 jpg, png and tiff compressed image files would balloon to an estimated 150 MB.

iskunk commented 5 years ago

As I understand, bmp format still does RLE, and the image files would be compressed within the distribution tarball (or .zip file) in any event. Would that improve the situation much?

Another possibility: add a set of Make rules that use some external tool (e.g. ImageMagick's convert utility) to convert the image files into bmp/pnm, combined with a fallback in the code to load in such a format if the normal jpg/png/tiff support is unavailable.

DanBloomberg commented 5 years ago

It doesn't make much sense to me.

It's hard to imagine a user who wants to run the test suite, but doesn't have the standard image i/o libraries. I have deliberately limited as far as possible the dependencies of leptonica on external libraries, but virtually all users will have png, gz, jpeg and tiff. I consider webp, jp2k and gif optional -- webp and jp2k because they are not yet widely used, and gif because in every way it is inferior to png.

You can make leptonica without libraries, and the I/O gets appropriately stubbed out.

I implemented the original (most basic) bmp utility, that has no compression. It's there because people sometimes use it, like pnm, but has nothing else to justify putting effort in.

One could rely on gzip or bzip2 to compress the tarball, but when it is uncompressed, you have a big expansion of files in the directory. Not a nice thing to do the the users.

And then the entire test suite would need to be converted to use these uncompressed files.

-- Dan

iskunk commented 5 years ago

The problem is not that those libraries are unavailable. (Although they could be, if you are building in a more unusual environment, like an embedded system.) The problem is that if you are preparing a build that deliberately does not use those libraries, then you don't have a way of validating that build. The image I/O is incidental to most of the tests, so in theory, having the images in a different format does not defeat their purpose.

(Of course, I could perform a separate build of the same source, with the external libraries enabled, and run the test suite using that. However, this doesn't actually validate the build that would be used, and could potentially let slip some unexpected error condition that results in part due to missing external libraries.)

How about the alternative option, then? I can provide a set of makefile rules that a user can invoke to automatically convert all the jpeg/png/tiff files into bmp/pnm. All that would be needed in the code, then, is that whenever an attempt to load e.g. foo.png fails, it then falls back to attempting foo.bmp (or pnm). This would leave the majority of users unaffected, while still allowing special-purpose builds to be run through the test suite.

DanBloomberg commented 5 years ago

It's too complicated. We support building using my hand-made makefiles, autoconf and cmake. How many libraries do you know that offer multiple ways to build?

You can build and run the tests using either the static makefiles or autoconf. cmake does not yet have the ability to run the tests after building (it's an old issue waiting implementation).

I can pretty well assure you that we're not going to complicate the github distro.

iskunk commented 4 years ago

I'm not sure I understand where you see the complexity coming from. The use of multiple build systems would only suggest a different approach; it's not a showstopper. I will sketch out an idea of this below to illustrate.

(Note that I can currently build and run the tests, as shown above, but nearly all the tests fail as they cannot load the test images. As I mentioned previously, I could produce a separate build of the tests, with the external libraries, and run those---but then that build is not usable to me, due to various legal and release-engineering issues. At present, the situation is that a Leptonica build that does not have external libraries enabled is un-testable; there is no way to validate it.)

Here is one way how I see this working:

A convert.sh shell script is added to the prog subdirectory. This script, when run inside its containing directory, calls some external tool (like ImageMagick's convert or some of the Netpbm utilities) to convert the various jpg, png and tiff files present into e.g. bmp format, under a different filename. (For example, foo.jpg results in a foo.jpg.bmp file.)
A new function is defined, something like pixReadWithFallback(), with the same signature as pixRead(). If it cannot load foo.jpg for whatever reason, it then attempts foo.jpg.bmp, and then as a final measure foo.jpg.pnm.
- This functionality could also be rolled into the existing pixRead(), with some mechanism to enable it at run-time, but I think you would likely prefer a separate/new function.
- An additional convention may be used to make multi-page TIFF files available in fallback form; e.g. .0.bmp, .1.bmp, and so on.
Tests are edited to use this new function instead of pixRead().
A user in my situation would need only know to run the convert.sh script before invoking make check, which should then complete without issue.

The majority of users who enable the external libraries as a matter of course are unlikely to even notice the added support for this scenario.

DanBloomberg commented 4 years ago

Pretty clever!

I don't want to put a fallback function into all the tests. However, as a compromise, suppose we make a script that converts those images files to bmp, but with the same name. Leptonica functions that read image data use the information in the header, not the filename extension, so a bmp file with a .png extension will be read properly as a bmp file.

Then we don't need to change the tests. You just run the script before the test.

iskunk commented 4 years ago

I noticed that quality of pixRead(). I don't like the idea of a file with extension X actually containing content Y, but if the tests are run in an out-of-source-tree build, then the files with the "wrong" extension can be created as temporary artifacts of the build. All that would be needed is the conversion script (which I can provide), and ensuring that the tests load files from the build tree in preference to the source tree.

I'll do some experimentation on this, and get back to you in a few days.

iskunk commented 4 years ago

Okay, I've put the converted files into place, but have encountered an issue:

$ ./adaptmap_reg 

////////////////////////////////////////////////
////////////////   adaptmap_reg   ///////////////
////////////////////////////////////////////////
leptonica-1.79.0 : (null)
Time for gray adaptmap gen:   0.008
Error in pixWriteStreamPng: function not present
Error in pixWrite: pix not written to stream
Error in findFileFormatStream: truncated file
Error in fopenReadStream: file not found
[...]

Do the tests need to write out intermediate files using an external format?

DanBloomberg commented 4 years ago

Ah yes, the write side of the issue.

Do the tests need to write out intermediate files using an external format?

Yes, they are designed that way. The full set of tests generates about 340 MB of image files in generate mode, half in /tmp/lept/regout/ and half in /tmp/lept/golden/. Then in compare mode, 170 MB of image files are written to /tmp/lept/regout/.

For example, the function regTestWritePixAndCheck() is useful because of its simplicity (you don't need to specify the temp file names). It compares files byte-for-byte. The decision about format for the files is made for maximum compression. If uncompressed files were used, we'd generate 5 or 6 GB of data! There are 1000 invocations of this function alone in the regression tests.

If instead of writing to file, we wrote to memory, we would have no files available to look at when the test fails.

iskunk commented 4 years ago

It's reasonable that more temporary space would be needed when uncompressed formats are the only ones available. Could we conditionalize the tests to not request png format when it is not compiled in?

DanBloomberg commented 4 years ago

You are persistent :-)

OK, I've done that. Need to handle png, jpeg and tiff_g4. Will push it out shortly.

iskunk commented 4 years ago

Just trying to find a way to make this work :-)

Okay, I've pulled in your changes, and am seeing an increase in the number of tests passing. There are still a number of tests failing with "function not present":

$ grep -l 'function not present' *.log | fgrep -v test-suite.log | while read x; do echo $x:; grep 'function not present' $x | sort -u; echo; done
adaptmap_reg.log:
Error in pixWriteStreamJpeg: function not present

alphaops_reg.log:
Error in pixWriteStreamJpeg: function not present
Error in pixWriteStreamPng: function not present

baseline_reg.log:
Error in pixReadStreamPng: function not present
Error in pixWriteStreamPng: function not present

blend3_reg.log:
Error in pixWriteStreamJpeg: function not present
[...]

I'm getting better success with conversion to pnm, though in general there seem to be some issues with the way the image is encoded. One example is feyn.tif not working correctly when converted to what appears to be a semantically equivalent bmp file. I have to investigate this further to better describe the issue, but it seems like there may be some interesting corner cases here.

DanBloomberg commented 4 years ago

These aren't corner cases -- there are 211 invocations of pixWrite() in the *_reg.c regression tests.

I can solve most of them by putting those same pre-processor statements in pixWrite().

DanBloomberg commented 4 years ago

done. Also included all variants of tiff compression supported by leptonica.

iskunk commented 4 years ago

Thanks, I'm seeing fewer failures (94 pass, 36 fail). I looked through a few of the latter:

baseline_reg fails due to passing GPLOT_PNG to gplotSimple1()
blend2_reg fails because it cannot find /tmp/lept/regout/blend2.14.png
boxa2_reg fails due to boxaPlotSizes() passing GPLOT_PNG to gplotCreate()
bytea_reg fails due to convertPdf() -> pixGenerateCIData() -> pixGenerateJpegData(). Is it feasible to have an uncompressed PDF encoding, for completeness? Else these PDF-related tests could be skipped when jpeg/zlib/etc. is unavailable.
ccthin2_reg fails due to l_bootnum_gen3() calling zlibUncompress(). Probably another one to skip when zlib is unavailable.
coloring_reg fails due to "pixa" fonts not working. Another probable skip (unless prog/fonts/chars-*.tif could be substituted).
enhance_reg fails due to src/bmfdata.h containing TIFF data. Another skip.

When you print e.g. "Warning in pixWrite: png library missing; output bmp format", you may want to add a newline, because in the test output logs the following line of output starts immediately after "format".

Lastly, after seeing the new TIFF logic, may I suggest something like

#define L_FORMAT_IS_TIFF(f) ((f) >= IFF_TIFF && (f) <= IFF_TIFF_ZIP)

? Seems like that same OR-statement crops up a few times...

DanBloomberg commented 4 years ago

Yes, I forgot the newline, which should always end the message string in the L_* macros.

And thanks for the L_FORMAT_IS_TIFF macro.

Can one make a macro of macros; e.g.,

define NO_EXTERNAL_IO_LIBS ???

iskunk commented 4 years ago

A macro can certainly call other macros; what did you have in mind?

Looking over src/imageio.h, specifically the IFF_* enums, I was thinking... you already have a couple of symbolic entries in there (IFF_UNKNOWN, IFF_DEFAULT). What if there were one like IFF_BEST, that would normally be equivalent to IFF_PNG, but can become IFF_BMP when necessary? Something like this could help abstract away a lot of the png references in the code.

This could potentially be taken further by combining it with a "symbolic" file extension like .best, which when seen on a filename, is automatically changed to .png or .bmp as appropriate.

DanBloomberg commented 4 years ago

I don't think it's necessary. At this point, the two high level image writing functions, pixWrite() and pixWriteMem(), are instrumented to fall back to bmp.

DanBloomberg commented 4 years ago

Just noticed that I should put the format changer in pixWriteStream(), not in pixWrite(), because pixWrite() calls pixWriteStream(). Will do this.

iskunk commented 4 years ago

The balance is now 101 pass, 29 fail.

Here are some suggested changes to the gplot code so that it can write pnm, and fall back to same when png is unavailable. (Additional tweaks are needed to get the tests passing; this is just a start.)

Some of the tests are going to need a fragment like

#if !defined(HAVE_LIBPNG)
  L_ERROR("This test requires libpng to run.\n");
  exit(77);
#endif

That will cause the test to return a result of SKIP in the "make check" run.

DanBloomberg commented 4 years ago

101 out of 130 seems GEFGW.

Thank you for the gplot patch. It doesn't save the tests, because we're writing to files named *.png, and gplot adds an appropriate extension, .pnm when outputting in pnm. But your fragment at least allows the failures to happen silently. Send me a list of tests that fail, and I'll add the fragment.

I was amused to see that the gplot pnm output has a lousy font for text, and is 200x bigger than the png output.

DanBloomberg commented 4 years ago

Skipped the regression tests that use gplot (they all emit png). You should now be at 101 pass, 17 fail.

DanBloomberg commented 4 years ago

Just realized that I can modify the gplot routines to return an image (png or pnm), simplifiying the procedure of generating it, and doing so without changing the gplotSimple1 (etc) interfaces.

Perhaps a weekend project.

iskunk commented 4 years ago

It's certainly getting better! Here is an initial set of tests that can be skipped:

Test name       Skip due to call to...
================    ==========================
boxa3_reg       pixaGenerateFontFromString
boxa4_reg       pixaGenerateFontFromString
ccthin2_reg     l_bootnum_gen3
colorcontent_reg    pixaGenerateFontFromString
coloring_reg        pixaRead
colorize_reg        pixaGenerateFontFromString
enhance_reg     pixaGenerateFontFromString
genfonts_reg        pixaGenerateFontFromString
insert_reg      pixaWrite
mtiff_reg       (multi-page TIFF functionality)
numa2_reg       pixaGenerateFontFromString
ptra2_reg       pixaWrite
quadtree_reg        pixaGenerateFontFromString
rankhisto_reg       pixaGenerateFontFromString
shear2_reg      pixaRead
subpixel_reg        pixaRead
writetext_reg       pixaRead

Function            Dependency
==========================  ==========
l_bootnum_gen3          zlib
pixaGenerateFontFromString  libtiff
pixaRead            libpng
pixaWrite           libpng

Separately: pngio_reg just needs the exit(77) bit.

This patch has a few further suggestions for the code:

gplot.c: Slight logic tweak so that output is still written to a .png file when a GPLOT_PNG -> GPLOT_PNM downgrade takes place. (This aside, having the gplot routines return an image object directly seems like it will be a better way to go. But for now, this should at least allow the gplot-using tests to be enabled.)
pdfio2.c: Implemented downgrade to L_FLATE_DECODE when zlib is available. This isn't a complete solution (there is no further fallback without zlib, not unless you implement an uncompressed PDF encoding), but zlib is usually available even when the other libs are not.

By the way, I ran into some trouble with the HAVE_LIB* conditionals in pdfio2.c when I noticed that the file was not #including config_auto.h. I would suggest adding that #include to every .c file that doesn't already have it. Even if a source file doesn't use HAVE_* et al., the config header could well define something like _XOPEN_SOURCE or function replacements that affect normal (non-preprocessor-oriented) code.

With the above change to pixGenerateCIData(), the pdfio1_reg test can be skipped when zlib is unavailable.

There are a few remaining tests needing attention, but let's get this low-hanging fruit out of the way first.

Re gplot font: I'm not conversant in Gnuplot, but you might find that tweaking the terminal type gives you better results. I assume getting the same pixels as png output should be possible.

DanBloomberg commented 4 years ago

Coincindentally, another collaborator is independently fixing some build warnings, and he has been replacing

if defined(xxx) ==> #if XXX

I believe you said that we should use #if defined(xxx) in general. Is that correct?

If we put

ifdef HAVE_CONFIG_H

include "config_auto.h"

endif / HAVE_CONFIG_H /

in every source file, will that solve the problem and allow us to use

if defined(XXX)

throughout?

DanBloomberg commented 4 years ago

I'd rather not do your gplot change with output_actual.

Instead my plan is to do this: In gplot, we can generate the pix directly with a single call, without having to specify a rootname and then generating the file name to read it back. And this can be done changing only gplotSimpleXY1(), ...XY2() and ... XYN() and adding new functions that call them: gplotSimplePix1(), gplotSimplePix2(), gplotSimplePixN()

Once we have these functions, the use (e.g., in the regression tests) will simply be through regTestWritePixAndCheck() We will not need to know the names of the temporary files that gplot creates.

iskunk commented 4 years ago

Working late today...

For HAVE_* symbols defined by Autoconf, yes: #if defined(HAVE_FOO) (or #ifdef HAVE_FOO) should be used. Because HAVE_FOO will take one of the following forms in the config header:

#define HAVE_FOO 1

...or...

/* #undef HAVE_FOO */

Using #if HAVE_FOO in the latter case only works due to compiler leniency (it's a common flub), and with -Wundef will print a warning.

(I suspect your collaborator is favoring #if due to what they saw in environ.h, interpreting that as the convention used by the project. The problem, of course, is that environ.h itself needs to change to follow the Autoconf convention.)

Adding config_auto.h to every source file isn't about resolving the #if/#ifdef issue, but about making all the HAVE_* symbols available in the first place. The problem I was having with pdfio2.c was that all my added code, which is surrounded by #if defined(HAVE_LIBZ), was not getting compiled in even though HAVE_LIBZ was correctly #defined to 1 in config_auto.h. (If I had incorrectly done #if HAVE_LIBZ, the result would not have been any different.)

Re gplot changes: Hiding the temporary file(s) sounds great. It's a bit more work, but ultimately a better approach.

DanBloomberg commented 4 years ago

Doesn't it make more sense to add those 3 lines to environ.h, which is included with allheaders.h in every source file?

stweil commented 4 years ago

I am the collaborater. @iskunk is right that #if defined(HAVE_FOO) or #ifdef HAVE_FOO should be used with the Autoconf macros, but that requires modifications in src/environ.h.

It is important that the definitions from config_auto.h are included in each source file which uses HAVE_FOO. Adding the three lines to src/environ.h and removing them everywhere else looks like a reasonable solution for me.

iskunk commented 4 years ago

Ah, @stweil, didn't know it was you :-) With this issue, I'm hoping to lay the groundwork to be able to run Tesseract's test suite successfully with a Leptonica build that has image libraries disabled.

I would still recommend #including the config header individually in every .c source file.

Yes, you can save yourself some typing by putting it in a common project header, and many projects do this. However, one important consideration for the config header is that it must be #included before any system header, because in theory it could #define feature test macros (e.g. _XOPEN_SOURCE) that are only effective if defined prior to any system #includes.

It is possible to do this in a common project header, but then it is not as obvious that the config header is coming first. And over time, this makes it more likely that someone accidentally places a system #include somewhere that breaks that requirement. It is easier to ensure that the #include is being done correctly if it is the first non-comment bit of code in the source file.

One other detail I neglected to mention previously: GNU recommends that the config header be pulled in with #include<> rather than #include"". Here is what they have to say:

With the appropriate -I option, you can use ‘#include ’. Actually, it's a good habit to use it, because in the rare case when the source directory contains another config.h, the build directory should be searched first.

DanBloomberg commented 4 years ago

Thank you for the explanation. I'll insert it as you recommend, in all source and program files.

On Sat, Nov 9, 2019 at 11:16 AM Daniel Richard G. notifications@github.com wrote:

Ah, @stweil https://github.com/stweil, didn't know it was you :-) With this issue, I'm hoping to lay the groundwork to be able to run Tesseract's test suite successfully with a Leptonica build that has image libraries disabled.

I would still recommend #including the config header individually in every .c source file.

Yes, you can save yourself some typing by putting it in a common project header, and many projects do this. However, one important consideration for the config header is that it must be #included before any system header, because in theory it could #define feature test macros (e.g. _XOPEN_SOURCE) that are only effective if defined prior to any system #includes.

It is possible to do this in a common project header, but then it is not as obvious that the config header is coming first. And over time, this makes it more likely that someone accidentally places a system #include somewhere that breaks that requirement. It is easier to ensure that the

include is being done correctly if it is the first non-comment bit of code

in the source file.

One other detail I neglected to mention previously: GNU recommends that the config header be pulled in with #include<> rather than #include"". Here https://www.gnu.org/software/autoconf/manual/autoconf-2.67/html_node/Configuration-Headers.html is what they have to say:

With the appropriate -I option, you can use ‘#include ’. Actually, it's a good habit to use it, because in the rare case when the source directory contains another config.h, the build directory should be searched first.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DanBloomberg/leptonica/issues/454?email_source=notifications&email_token=AD7KMLFLX5RR5IPTAN25X7TQS4EB7A5CNFSM4JGAFAU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDUNCQY#issuecomment-552128835, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7KMLAF7ZWP5OOUMB2CMC3QS4EB7ANCNFSM4JGAFAUQ .

stweil commented 4 years ago

I wonder whether the conditional around #include <config_auto.h> is really needed. I don't think so, because both Automake and cmake builds provide that file, and builds without it would not work correctly.

DanBloomberg commented 4 years ago

My handbuilt makefile doesn't generate the file. environ.h handles the critical defines (for external libraries and special functions).

stweil commented 4 years ago

I see. Thank you.

iskunk commented 4 years ago

Tested with the latest changes. Will it be possible to re-enable the tests that use gplot? E.g. baseline_reg should now be able to pass (with minor changes).

blend2_reg currently fails because regTestWritePixAndCheck() writes blend2.14.bmp but the test program looks for blend2.14.png. Could that be addressed with something like format_actual?

DanBloomberg commented 4 years ago

I fixed blend2_reg to work with fallback to bmp. It's ugly but c'est la vie. To fix baseline2_reg, I need a new function gplotMakeOutputPix() that returns the pix of the plot. I will try to do this in the next few days, and push these changes.

Beyond that I'm unlikely to get around to doing much more for a few weeks, at least, although I can add some skips for you if any tests are still failing.

On Tue, Nov 12, 2019 at 9:09 AM Daniel Richard G. notifications@github.com wrote:

Tested with the latest changes. Will it be possible to re-enable the tests that use gplot? E.g. baseline_reg should now be able to pass (with minor changes).

blend2_reg currently fails because regTestWritePixAndCheck() writes blend2.14.bmp but the test program looks for blend2.14.png. Could that be addressed with something like format_actual?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DanBloomberg/leptonica/issues/454?email_source=notifications&email_token=AD7KMLCICXX372RDHLWX2FLQTLPLPA5CNFSM4JGAFAU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOED27OYQ#issuecomment-552990562, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7KMLH65AVLRQIFDZ4EWADQTLPLPANCNFSM4JGAFAUQ .

iskunk commented 4 years ago

Did you push the changes to blend2_reg? (I'm not seeing any.) You didn't opt to have regTestWritePixAndCheck() just write bmp to a .png file? This isn't the only test that has this problem, after all...

I'll gladly try out the gplot-related updates once you have them.

There's still some work left. For now, we've been addressing the relatively trivial matter of tests expecting certain filenames. But once those are straightened out, there are a few tests that fail due to differences brought about by the test images being in bmp or pnm format---differences that should probably not be affecting the results, given that the format conversion presumably yields a semantically equivalent output.

I recall some tests complained about colormaps, and I think another one errored out when a 1-bpp image got converted to a 0-or-255 8-bpp file (due to format limitations). These failures should raise more interesting questions that what we've looked at so far.

DanBloomberg commented 4 years ago

No, not pushed yet. Likely today sometime. Thank you for offering to test, and I'll let you know when I do.

On Wed, Nov 13, 2019 at 12:12 PM Daniel Richard G. notifications@github.com wrote:

Did you push the changes to blend2_reg? (I'm not seeing any.) You didn't opt to have regTestWritePixAndCheck() just write bmp to a .png file? This isn't the only test that has this problem, after all...

I'll gladly try out the gplot-related updates once you have them.

There's still some work left. For now, we've been addressing the relatively trivial matter of tests expecting certain filenames. But once those are straightened out, there are a few tests that fail due to differences brought about by the test images being in bmp or pnm format---differences that should probably not be affecting the results, given that the format conversion presumably yields a semantically equivalent output.

I recall some tests complained about colormaps, and I think another one errored out when a 1-bpp image got converted to a 0-or-255 8-bpp file (due to format limitations). These failures should raise more interesting questions that what we've looked at so far.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DanBloomberg/leptonica/issues/454?email_source=notifications&email_token=AD7KMLCJ4QFIUMFUBFGUEMTQTRNTRA5CNFSM4JGAFAU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOED7PXDY#issuecomment-553581455, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7KMLH3OGQVRRDWLHBUKTDQTRNTRANCNFSM4JGAFAUQ .

iskunk commented 4 years ago

Still seeing a failure in blend2_reg, but it is a different, more interesting one now.

One thing I've found: pnm format can't do RGBA (unless you resort to a non-official variant), so I've switched over my converted files to bmp. And a lot more tests are breaking now, unfortunately.

Here's one interesting example: adaptnorm_reg. The lighttext.jpg file is a greyscale jpeg. If I convert it to a "truecolor" 24-bpp bmp file, then pixContrastNorm() errors out because the image is not 8 bpp. If I convert it to a straight grayscale file, however, then the bmp file has a colormap, and pixContrastNorm() errors out because of that.

So what should happen here? It doesn't seem like there can be an 8-bpp greyscale bmp file without a colormap. Perhaps there is a way to recognize when the colormap is actually just an artifact of a greyscale encoding?

DanBloomberg commented 4 years ago

Leptonica supports pam, which is an official extension of pnm. You can verify that reading and writing pnm works on an RGBA image:

pix1 = pixRead("books_logo.png");
pixDisplay(pix1, 100, 100);
pixWrite("/tmp/pix1.pnm", pix1, IFF_PNM);
pix2 = pixRead("/tmp/pix1.pnm");
pixDisplay(pix2, 450, 100);

Yes, bmp is a crummy format; it doesn't have 8 bpp gray without a cmap. I really dislike supporting these lousy legacy encodings. Another example is gif, which doesn't even support RGB.

iskunk commented 4 years ago

Ah, I wasn't aware of pam. I've switched most of the files over to that now. (Should the library use IFF_PAM rather than IFF_PNM, then? That does appear to be a more versatile format.)

Some files (the four-color ones), I'm converting to bmp, as the tests seem to require a colormap.

However, I'm seeing this error crop up across many tests:

Error in pixcmapIsValid: invalid cmap n: 33 (nalloc = 16)

Does that look familiar? This instance is from loading weasel2.4c.png, converted to an equivalent bmp file.

DanBloomberg commented 4 years ago

I've never seen that error.

When I do this:

pix1 = pixRead("weasel2.4c.png");
pixWrite("/tmp/junk.bmp", pix1, IFF_BMP);

I get this output: Warning in pixWriteMemBmp: 2 bpp files can't be read; converting to 8 bpp

and 'fileinfo /tmp/junk.bmp' gives, as expected,


===============================================
Reading the header:
  input image format type: bmp
  w = 82, h = 73, bps = 8, spp = 1, iscmap = 1
===============================================
Reading the full image:
  input image format type: bmp
  w = 82, h = 73, d = 8, spp = 1, wpl = 21
  xres = 0, yres = 0
  colormap exists and has color values:
Pixcmap: depth = 8 bpp; 4 colors
Color    R-val    G-val    B-val   Alpha
----------------------------------------
  0        48       44       40      255
  1       232      240      232      255
  2       168      172      168      255
  3       184      144       96      255

===============================================

iskunk commented 4 years ago

Could you give this a try, and see if you get the same result? ("convert" is the ImageMagick utility.)

$ convert weasel2.4c.png -depth 2 -format bmp3 -compress None weasel2.4c.bmp

$ ./fileinfo weasel2.4c.bmp
Error in pixcmapIsValid: invalid cmap n: 33 (nalloc = 16)
Error in pixReadStream: invalid colormap
Error in pixRead: pix not read
Error in pixReadHeader: bmp: pix not read
Error in writeImageFileInfo: failure to read header of weasel2.4c.bmp

DanBloomberg commented 4 years ago

I ran the converter, and I believe it makes an invalid file. As stated in bmpio.c, 2 bpp is not a valid bmp file. Just checked with wikipedia:

In the original OS/2 DIB, the only four legal values of color depth were 1, 4, 8, and 24 bits per pixel (bpp).[4] Contemporary DIB Headers allow pixel formats with 1, 2, 4, 8, 16, 24 and 32 bits per pixel (bpp).[17] GDI+ also permits 64 bits per pixel.[18]

So it's a moving target. But there is no reason for leptonica to support anything beyond the original spec.

Running display on the image, it is clearly different in color from the original png. Same with xzgv. And xli doesn't support the 2 bit bmp.

Running ' identify -verbose' gives a 4 color, 16 entry colormap with 2 bit samples, which is not possible -- 2 bit samples can only support a 4 color colormap. So, again, it appears that even imageMagick hasn't bothered to produce a valid implementation of 2 bpp bmp.

Leptonica does the right thing -- it converts to 8 bpp on writing a 2 bpp pix in bmp format. However, on reading I should abort 2 bpp bmp. Will fix this.

Yesterday I added include of auto_config.h to all 288 prog files (!!). I now have all tests passing again (when libraries are available). Before adding, about 25 were skipped, and one failed with autotools and 'make check'.

iskunk commented 4 years ago

Hmm. I get the same error even if I drop the -depth 2:

$ convert weasel2.4c.png -format bmp3 -compress None weasel2.4c.bmp

$ ./fileinfo weasel2.4c.bmp 
Error in pixcmapIsValid: invalid cmap n: 33 (nalloc = 16)
Error in pixReadStream: invalid colormap
Error in pixRead: pix not read
Error in pixReadHeader: bmp: pix not read
Error in writeImageFileInfo: failure to read header of weasel2.4c.bmp

identify -verbose shows it to have 8-bit channels, with Depth: 4/8-bit. Shouldn't that be a valid bmp file?

DanBloomberg commented 4 years ago

Yes, it should read that bmp file. The read was failing because I'd hardcoded the size of the infoheader, so it got the wrong number for the number of colors in the colormap. Made a fix that uses the actual size of the infoheader.

iskunk commented 4 years ago

Thanks Dan. Current test summary on my end:

=======================
17 of 106 tests failed
(27 tests were not run)
=======================

I reviewed the remaining failures. Here are the ones that either need some filename-extension tweaking, or could (potentially) be turned into SKIPs due to missing libraries:

bytea_reg, pdfseg_reg appear to hard-depend on jpg support (for PDF encoding)
compare_reg, italic_reg, pdfio1_reg appear to hard-depend on zlib (for "flate" PDF encoding)
fpix1_reg, kernel_reg are looking for a png file in /tmp/lept/ when only a pnm file is present
pageseg_reg appears to hard-depend on tiff (for PDF encoding)
pdfio2_reg appears to hard-depend on jpg, tiff, and zlib (all for PDF encoding)
psio_reg appears to hard-depend on jpg (for Postscript encoding)
psioseg_reg appears to hard-depend on tiff (for Postscript encoding)

DanBloomberg / leptonica

Test suite fails without external image libraries #454

define NO_EXTERNAL_IO_LIBS ???

if defined(xxx) ==> #if XXX

ifdef HAVE_CONFIG_H

include "config_auto.h"

endif / HAVE_CONFIG_H /

if defined(XXX)

include is being done correctly if it is the first non-comment bit of code