bash0 / cewe2pdf

A python script to turn cewe photobooks into pdfs
GNU General Public License v3.0
41 stars 26 forks source link

Checking test results is a bit weak? #111

Open AnEnglishmanInNorway opened 3 years ago

AnEnglishmanInNorway commented 3 years ago

Basically we test that the expected number of pages are present with pdfrw which appears to be unmaintained since 2018. Would it be worth analysing the result pdf with a package such as pdfminer.six, pikepdf or pypdf2/pypdf4? Or indeed other packages, see here I'm not entirely convinced about this idea, since any really detailed analysis would fail on the git checkin build, where cewe is not installed and some things used in test mcf files might be missing (in particular fonts, but also cliparts and other stuff which cewe and their franchises supply to help create the albums). But it would be useful on our local machines, I think, to see a more precise confirmation that our changes had not affected existing usage. Perhaps we can programatically distinguish between the checkin build tests and our local tests. Update: I had a quick look at a more detailed analysis of the single non-empty output page from testClipart, in which each of three different cliparts is displayed twice in a pdf viewer, so I expected six images. But just extracting the images is clearly not as simple as it appears - the number of XObjects with images differs depending on which tool is used to find them. pdfminer.six produced 18 image objects from the testclipart pdf, of which 8 were apparently invalid bmp files, 8 were four copies each of two of the images we actually see, and 2 were copies of the third. Other tools gave different numbers. And when I browsed the file structure with a graphic tool (see below) I could only find 5 XObjects with images! This series of articles (and other similar) indicate why - images are stored in a pdf specific way, quite different from originals used when creating the pdf, and may even be split up into tiles. So ... perhaps a tool to compare complete pdf files would be a better bet. I have a personal license for BeyondCompare which I use to verify that a new version is identical, but that (like others) only compares the text. Something like that might work but would not be much help with the graphics which are after all rather important in a photo album. A couple of useful tools I found along the road: PDFXplorer, and similar but I think slightly better PDFAnalyzer. And the free version of the Kiwi PDF Comparer is really worth looking at - it doesn't help with test automation, but it really helps to show visual changes between two pdfs. I'm certainly going to be using it to check the effect on test_simpleBook when I make changes.

cweiske commented 4 months ago

I just tried to install all dependencies on a new debian 12 linux machine and found that I couldn't install pdfrw with apt anymore. This package is unmaintained and has been removed from debian: https://github.com/pmaupin/pdfrw/issues/191 and https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=958362

Switching away from pdfrw to something different thus would be a good idea. img2pdf switched to pikepdf.

There is a pdfrw fork that has no commits since 2 years, so I wouldn't use it either: https://github.com/sarnold/pdfrw

AnEnglishmanInNorway commented 4 months ago

I'll move it to pikepdf, a few statements. But the issue is still valid, our automated testing really is rather weak!

cweiske commented 4 months ago

Regarding testing: Maybe convert the pdf generated by the test into an image, then convert the expected pdf into an image and use an image diff tool to see if there are differences.

AnEnglishmanInNorway commented 4 months ago

Regarding testing: Maybe convert the pdf generated by the test into an image, then convert the expected pdf into an image and use an image diff tool to see if there are differences.

Yes, maybe. It could be that 100% PDF equality would be nice but too strict - "visually equal" PDF documents can surely be represented in several equivalent ways within the range of capability of the PDF format. I use the Kiwi PDF comparer, which is fine for interactive use, but it occasionally shows differences which I can't actually "see". I wonder if that is a case in point. An image comparison might get round that problem, but ... I once worked on a big project where we were caught out using X11, which from one release to the next shifted everything one pixel sideways. No change to the naked eye, but every test we had broke!

There's a good conversation on the topic of pdf comparison here

cweiske commented 4 months ago

We could do low resolution comparison to get around pixel problems. When changes are made that affect the output, the expected pdfs could be regenerated.

AnEnglishmanInNorway commented 1 month ago

Wondering whether I should pick this one up again. I found a couple more useful tools:

I think pdfquery might be the one to choose if we wanted to go ahead, but I still have a feeling that using a visual comparison tool with a standard (and continually improving) test file such as unittest_fotobook is going to be the easiest choice at our level of usage. So I decided once again to just leave this issue open.