PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents
https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full
Other
6 stars 7 forks source link

Cannot load PDFs produced by TeX #197

Closed sasozivanovic closed 8 months ago

sasozivanovic commented 10 months ago

PDF::Builder refuses to open files generated by TeX, yielding the following error:

PDF Integrity Check: Root object 11.0 not found!
Can't call method "verCheckOutput" on an undefined value at /usr/share/perl5/site_perl/PDF/Builder/Basic/PDF/File.pm line 1615.

I have tested several PDFs, with the same result (but a different root object number, obviously). The problem furthermore seems to affect all engines (tested: pdfTeX, LuaTeX, XeTeX) and formats (tested: LaTeX, plain TeX, ConTeXt). For LaTeX, I used the following code:

\documentclass{article}
\begin{document}
Hello, world!
\end{document}

Here is the resulting file, compiled with the the default pdfTeX engine:

PhilterPaper commented 10 months ago

There is no object 11 in uncompressed.pdf. Well, there's something there, but the first line is commented out with %:

% 11 0 obj
<<
/Type /Catalog
/Pages 6 0 R
>>

There are several other objects similarly commented out. Is that how the PDF was produced, or did someone edit it?

There's no object 11 in doc.pdf, even though the /Root specifies it.

Neither PDF meets specifications, but apparently most Readers can get around the problems and patch them up enough to run. Can you (or anyone) show that these PDFs are fully legal?

PhilterPaper commented 10 months ago

Upon looking further, I see that object 11 (also objects 2, 1, 7, 9, 4, 6) are part of an Object Stream in object 5. This is apparently a legal structure in PDF 1.5, but unfortunately neither PDF::API2 nor PDF::Builder support it (they support up to 1.4). I'm guessing that API2 simply didn't notice this odd construct, while Builder did (because it's trying to validate the file and report missing items).

I'm not sure what pdfTeX was trying to do here, but it's definitely not legal PDF 1.4. If that was a fatal error, perhaps I can put in a flag to make it merely a warning?

PDF 1.5
obj 3:  text object that outputs "Hello, world", parent is 2
obj 8:   embedded PS font definition, parent is 9
obj 10:  some sort of embedded PS resource definition? parent is 4
obj 12:  Info object (creator, date, etc.), parent is 13
obj 5:  object statement   (NOT supported by PDF::API2 or PDF::Builder). no explicit parent?
    obj 2: Page, parent is 6, content 3, resources 1
    obj 1: Font resource, child is 4, parent is 2
    obj 7: font width table, parent is 4
    obj 9: font descriptor, child is 8, parent is 4
    obj 4: T1 font info, child is 10, fontdescriptor is 9, widths is 7, parent is 1
    obj 6: Pages, parent is 11, child is 2
    obj 11: Catalog, parent is 13, child is 6
obj 13: cross reference stream, points to 11 as root and 12 as info. top level.
sasozivanovic commented 10 months ago

By default, TeX(Live) produces 1.5 output. I have tested Builder on 1.4 output, and it works fine there.

If that was a fatal error, perhaps I can put in a flag to make it merely a warning?

It was fatal. But the real question is whether things will work if you make it a warning.

PhilterPaper commented 10 months ago

Well, Builder (and API2) can handle a limited number of PDF 1.5 constructs, such as cross reference streams. However, not all have been implemented, such as the object stream that your 1.5 sample included. They certainly won't work if the PDF claims to be 1.4-compliant, but as your sample said, it's 1.5 and Readers are happy with it. If I can't figure out how to fully handle object streams, I can at least make the integrity check dependent on the PDF version number (if 1.4 or lower, fatal error; if 1.5 or higher, just a note that there may possibly be a problem). At least, you would know if/when Builder failed to properly deal with the PDF file that the likely reason is that something 1.5 (or higher) level was found in there.

In the meantime, if you can specify PDF 1.4 level output from pdfTeX without losing any important features, that could be a good workaround. Since 1.5 is the default output for your tools, I should at least plan to support more 1.5 features if I can.

But the real question is whether things will work if you make it a warning.

Depending on exactly what you are doing with the PDF (how it's being modified, etc.), the use of an Object Stream may or may not be a big problem. Apparently API2 was able to work with it, so Builder ought to be able to, too. It's just that Builder adds an integrity check that certain things are where they claim to be, and the 1.5 level Object Stream fooled it. Unless and until I'm sure that Object Streams are properly handled (no promises that I can do that), it should at least generate a warning (if 1.5 or higher) that Builder may not be able to handle it, and at 1.4 or lower, should be a fatal error (no Reader is going to accept it).

vadim-160102 commented 8 months ago

The 2 lines of "error" in OP are 2 different messages, originating from completely unrelated places. Both messages falsely claim there are problems when in fact there are none.

1st message is from Builder.pm line #4467, subroutine is messing around PDF file (splitting/regexing which is not parsing by the spec), fills the %objList hash with some values, then of course there is no entry in this hash for the key supposed to represent Root/Catalog (though $Root variable is initialized OK), hence STDERR output in #4467 which should be ignored.

2nd message is fatal error from PDF-Builder-3.025/lib/PDF/Builder/Basic/PDF/File.pm line #1615 because by the time method is called on global variable this variable was not yet defined. Well in fact using global package vars (seems to be 2 of them)

our $myself;       # holds self->pdf
our $global_pdf;   # holds self ($pdf)

in Builder.pm to keep instance data doesn't look OK to me (regardless of this error).

Moreover, opening the sample-xrefstm.pdf from nowhere else but its own test suite causes Builder to die for the same reason. And, it's funny, the test passes OK because, compared to same test file from PDF::API2, there's extra line which, in effect, initializes that very package variable. Ouch! Excuse me, it's just while I'm here browsing this queue, and this thread was frustrating.

PhilterPaper commented 8 months ago

I'm not sure what your point is. As I discussed in previous posts here, the "missing" object turned out to be within an object stream, which is a PDF 1.5 feature. I think what I'll have to do is turn down the error level to a warning if the PDF version is 1.5 or higher. In the meantime, if the OP can use PDF 1.4 output, this problem won't occur.

PhilterPaper commented 8 months ago

OK, I think I've got it. I have improved the Integrity Check so it doesn't call an error a missing object possibly hidden in an object stream. Actually, even an 'error' found by Integrity Check is just informational (not actually fatal). The actual fatal error you got concerning verCheckOutput was because you were doing an "open()" before you had the top level "pdf" object created! Your test code should look more like this:

use strict;
use warnings;

my $choice = 'B'; # A=PDF::API2, B=PDF::Builder
my $input = 'uncompressed.pdf';
my $output = 'out_unc.pdf';

my ($in, $pdf);
if ($choice eq 'A') {
    use PDF::API2;
    $pdf = PDF::API2->new();
} else {
    use PDF::Builder;
    $pdf = PDF::Builder->new();
}
$in = $pdf->open($input);
$pdf->import_page($in, 1);
$pdf->saveas($output);

Builder.pm and Basic/PDF/File.pm have been updated. It should run with a PDF-1.5 input now. You can close this issue if you are satisfied with the fix, or I can do it.

sasozivanovic commented 8 months ago

Hmm, for me, the issue persists, with the same document as in the first post, i.e. I still get the PDF Integrity Check error for a PDF-1.5 document; the other (verCheckOutput) error is gone. If I add \pdfminorversion=4 to the top of the file, to request PDF-1.4, all is fine.

PhilterPaper commented 8 months ago

Yes, you will still be told that Integrity Check can't find a requested object, but that it may be a false alarm due to the use of an Object Stream. There is also a message warning that the output version had to be bumped up to 1.5 due to a 1.5 feature (cross-reference stream) being found. I may consider silencing that second warning and just automatically and silently upping the version to 1.5. I don't want to silence the first warning, as (if your output is intended to be version 1.4) it may indicate a real failure.

The output I get follows. Do you get the same thing?

PDF Integrity Check: Root object 11.0 not found, but this may be the result of putting it in an Object Stream. PDF version of requested feature 'importing cross-reference stream' is higher than current output version 1.4 (version reset to 1.5)

Anyway, you can regard these warnings as simply informational, and not preventing a valid output.

sasozivanovic commented 8 months ago

Gotcha! All good now. May thanks for all!

I had another little scare in between, when it seemed that the verCheckOutput error is back. Then I realized that I get that error if I don't execute PDF::Builder->new(); before executing PDF::Builder->open('foo.pdf');. Is this intentional?

P.S. Option --library (with PDF::Builder as a choice, of course) is now implemented, and will be out in the upcoming release of my package Memoize.

PhilterPaper commented 8 months ago

Boo! (only 2 days after Halloween, so there!) Yes, "new()" initializes a lot of stuff and always must be called first. That's common practice in many packages (or it's implicitly called behind the scenes). Even PDF::API2 is that way (you've probably had some subtle errors without realizing it, if you called "open()" first). "open()" is just a method to load in an existing PDF file, and depends on certain things being initialized. I wonder if I should put in checks in various methods to see if a necessary predecessor (ultimately "new()") has in fact been called... this is the first time I've heard of someone trying to do this out of order.

Good to hear that Memoize will now support PDF::Builder. Thank you!

sasozivanovic commented 8 months ago

Yeah well, I guess it's just me, a stranger to the perly habits, but if I can call both methods from the PDF::Builder->, I assume that I can do that ;-) Seriously, though, one line in the documentation of open would have saved me quite a few hours.

PhilterPaper commented 8 months ago

As long as you're not at the Pearly Gates! :-) I'll look at the documentation to see if it can use some improvements/clarifications. Please be sure to look in examples/ for suggested usage.

PhilterPaper commented 8 months ago

I think I have improved things so that you only need to invoke the open() method, without also calling new() explicitly (open() creates a new PDF object, but was missing some code found in the new() method). I see that your example code used with PDF::API2 produces a PDF with two copies of page 1 -- assuming that's what you intended, with PDF::Builder it now does it without an explicit call to new(). The revised example:

use strict;
use warnings;

my $choice = 'B';
#my $input = 'uncompressed.pdf';
#my $output = 'out_unc.pdf';
my $input = 'doc.pdf';
my $output = 'out_doc.pdf';

my ($pdf);
print STDERR "about to open input PDF\n";
if ($choice eq 'A') {
    use PDF::API2;
    $pdf = PDF::API2->open($input);
} else {
    use PDF::Builder;
    $pdf = PDF::Builder->open($input);
}
print STDERR "about to import_page\n";
$pdf->import_page($pdf, 1);
$pdf->saveas($output);

This produces output in PDF::Builder:

about to open input PDF PDF Integrity Check: Root object 11.0 not found, but this may be   the result of putting it in an Object Stream. PDF version of requested feature 'importing cross-reference stream' is higher   than current output version 1.4 (version reset to 1.5) about to import_page

with only the first and last "about to..." being seen in PDF::API2. The other lines are informational.

I await your confirmation that this sounds like how you expected the code to work, before putting it in GitHub.

sasozivanovic commented 8 months ago

Indeed, this looks as I had understood the API at first sight. Many thanks!

Given that I only stumbled upon the "open" problem with Builder, can I assume that "open"ing straight away works in API2 as well?

PhilterPaper commented 8 months ago

Changing $choice to 'A' appears to work properly, so I would say "yes".

Don't forget that there's now no need to explicitly call new(), as open() creates the $pdf object.

sasozivanovic commented 8 months ago

Changing $choice to 'A' appears to work properly, so I would say "yes". I tested and agree.

So I await the change to be pushed to GitHub! Thanks for all!

PhilterPaper commented 8 months ago

So I await the change to be pushed to GitHub! Thanks for all!

Ask, and thee shalt receive...

sasozivanovic commented 8 months ago

All good, thanks!