PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents
https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full
Other
6 stars 7 forks source link

[RT 131147] Bug with missing properties including patch #110

Closed PhilterPaper closed 4 years ago

PhilterPaper commented 4 years ago

Fri Dec 06 13:02:11 2019 Klaus [...] Ethgen.ch - Ticket created From: Klaus Ethgen <Klaus [...] Ethgen.ch>

Hello,

the attached pdf will fail in importPageIntoForm as it does not have a "Rotate" property.

The patch also attached will fix that.

Regards Klaus -- Klaus Ethgen http://www.ethgen.ch/ pub 4096R/4E20AF1C 2011-05-16 Klaus Ethgen Klaus@Ethgen.ch Fingerprint: 85D4 CA42 952C 949B 1753 62B3 79D0 B06F 4E20 AF1C

gaforkarte_schweiz.pdf 0001-Work-around-an-missing-property-search.patch.txt

Fri Dec 06 19:02:11 2019 PMPERRY [...] cpan.org - Correspondence added

  1. Can you give some PDF::API2 code that shows the problem?

  2. How does this compare to RT 130722? Might it be the same problem?

PhilterPaper commented 4 years ago

Mon Dec 09 09:41:13 2019 futuramedium [...] yandex.ru - Correspondence added

There's "Parent" key in pages root with value of "null". Failing code:

perl -MPDF::API2 -e "PDF::API2->new->importPageIntoForm(PDF::API2->open('gaforkarte_schweiz.pdf'),1)"

With patch applied, e.g. adding a new page still fails:

perl -MPDF::API2 -e "PDF::API2->open('gaforkarte_schweiz.pdf')->page"

Failing line number is in familiar from couple years ago method "add_page", which I don't want to investigate again :)

Further, there are nulls as values of properties of outline item in the file. This PDF producer seems to add nulls in unexpected places, hard to tell if any other PDF::API2 methods will fail. PDF is still valid, because the Reference says:

"Specifying the null object as the value of a dictionary entry ... is equivalent to omitting the entry entirely."

I'd suggest patching with this in mind, i.e. ignore such entries on read and therefore don't preserve them on update. Replace line 480 (version 2.036) PDF/API2/Basic/PDF/File.pm:

#                $result->{$key} = $value;
                $result->{$key} = $value 
                    unless ref($value) eq 'PDF::API2::Basic::PDF::Null';

Won't help in (hypothetical) case of "null" being value of indirect object, but such files are yet to be seen.

PhilterPaper commented 4 years ago

Mon Dec 16 12:11:02 2019 PMPERRY@cpan.org - Correspondence added On Mon Dec 09 09:41:13 2019, vadimr wrote:

Failing line number is in familiar from couple years ago method "add_page", which I don't want to investigate again :)

Was that RT 121911 by any chance? That bug has been fixed in both API2 and Builder, so I'm wondering (if and) how this one is related.

I'd suggest patching with this in mind, i.e. ignore such entries on read and therefore don't preserve them on update.

Vadim, I applied your patch (File.pm) and it appears to fix the problem (both Perl one-liners you gave above). It also doesn't break any t-tests or examples in PDF::Builder. Note that I did NOT add Klaus's patch to Pages.pm -- is his patch made unnecessary by yours? As he has not provided any test code, I can't be absolutely sure that your patch alone fixes the problem.

Per my earlier question about RT 130722, this patch does not seem to affect it.

If Klaus gives his blessing to your patch (alone), I will release it in PDF::Builder. It's up to Steve as to what he wants to do with PDF::API2.

Mon Dec 16 12:18:18 2019 Klaus@Ethgen.ch - Correspondence added Am Mo den 16. Dez 2019 um 18:11 schrieb Phil M. Perry via RT:

I'd suggest patching with this in mind, i.e. ignore such entries on read and therefore don't preserve them on update.

Vadim, I applied your patch (File.pm) and it appears to fix the problem (both Perl one-liners you gave above). It also doesn't break any t-tests or examples in PDF::Builder. Note that I did NOT add Klaus's patch to Pages.pm -- is his patch made unnecessary by yours? As he has not provided any test code, I can't be absolutely sure that your patch alone fixes the problem.

Hmm, didn't I provide a (broken) PDF? I thought I had.

I had the trouble in the following part:

  sub import_pdf
  {
     my $dest     = shift;
     my $source   = shift;
     my $urlprefs = shift;

     my $pages = $source->pages;
     for (my $i = 1; $i <= $pages; $i++)
     {
        my $pref = {};
        $pref = $urlprefs->[0]  if exists($urlprefs->[0]);
        $pref = $urlprefs->[$i] if exists($urlprefs->[$i]);
        my $rot = $pref->{rotate} || 0;
        my $displayrot = $pref->{displayrotate} || $rot;
        my $trans = $transform{$rot} || [0, 0];

        my $page = $dest->page;
        $page->rotate($displayrot);
        my $gfx = $page->gfx;
        my $xo = $dest->importPageIntoForm($source, $i);
        $gfx->rotate($rot);
        $gfx->formimage($xo, $trans->[0] + ($pref->{x} || 0), $trans->[1] + ($pref->{y} || 0), $pref->{scale} || 1);
     } ## end for (my $i = 1; $i <= $pages...)

     return $pages;
  } ## end sub import_pdf

It is part of a dirty hack to collect pdf as well as plain text or images and put them together into a single pdf.

I will have a look at Vadims patch.

Regards Klaus -- Klaus Ethgen http://www.ethgen.ch/ pub 4096R/4E20AF1C 2011-05-16 Klaus Ethgen Klaus@Ethgen.ch Fingerprint: 85D4 CA42 952C 949B 1753 62B3 79D0 B06F 4E20 AF1C

Mon Dec 16 14:54:03 2019 PMPERRY@cpan.org - Correspondence added On Mon Dec 16 12:18:18 2019, Klaus@Ethgen.ch wrote:

Hmm, didn't I provide a (broken) PDF? I thought I had.

You provided a PDF (map of Switzerland) that loads correctly. That's the one that Vadim used in his two one-line examples. You provided a patch, but I didn't see any example code showing the problem you're reporting. That's what I was asking to see.

Let me play with the code you just listed, with and without Vadim's patch, and perhaps with and without your patch, and see what happens.

Mon Dec 16 15:54:06 2019 PMPERRY@cpan.org - Correspondence added I added the following code to drive your import_pdf() routine

use strict;
use warnings;
use PDF::Builder;

my @prefs; # array of hashes, one per page
    # rotate (default 0)
    # displayrotate  (default 'rotate' value)
    # x (default 0)
    # y (default 0)
    # scale (default 1)
my $source = PDF::Builder->open('gaforkarte_schweiz.pdf');
my $dest = PDF::Builder->new();
my %transform;
    # => rotate  (default [0,0])

$dest->mediabox('A4');

#$prefs[0] = {'scale' => 0.35};
$prefs[0] = {'rotate'=>-90, 'x'=>-600,'y'=>-100, 'scale' => 0.35};

import_pdf($dest, $source, \@prefs);
$dest->saveas("output.pdf");

with Vadim's patch, and it seems to produce a good PDF. Without his patch I got the error about the missing 'find_prop'.

rotate() is CLOCKWISE and must be a multiple of +/-90 degrees. The x and y offsets need to be adjusted to keep the image on the page.

PhilterPaper commented 4 years ago

Tue Dec 17 06:33:59 2019 futuramedium [...] yandex.ru - Correspondence added

How does this compare to RT 130722? Might it be the same problem? Was that RT 121911 by any chance? ...how this one is related. ...Klaus's patch to Pages.pm -- is his patch made unnecessary by yours?

Phil, RT 130722 was about malformed PDF with broken pages tree, and I remain of the same opinion: no correction was required on PDF::API2 side; that RT can not (and should not) be compared to others concerning valid PDF files; error message and/or source line number being the same (or close) is co-incidental and of no importance.

Klaus's patch would fix the issue with "importPageIntoForm" with attached PDF, but my 2nd one-liner was to show that, following the route of applying a fix close to exposed line number, the 2nd patch would be required to fix failing "page" call, with line number somewhere close to RT 121911. That's the only reason I mentioned it. No other connection.

Instead, my idea was to find more universal solution. The suggested patch should practically work OK, but I think some considerations are to be documented at least. The "null" object itself is not a scarecrow, it's very common in, e.g., explicit destination arrays. But, as value of dictionary entry, the only place in the Reference I found is quoted above. Does it mean such entries can be safely discarded? Can any entry, which is strongly "required", somehow finish having value "null"? I don't think so, but, what if, per quote above, we discard this entry, and "required" entry disappears from updated file? PDF ideology, which I agree with and which is NOT strictly adhered to by, e.g. PDF::API2, is don't touch entries you don't know what to do about. They can be from some future PDF specification :). I don't like the idea of throwing things away. But that's pure speculation.

Ideally, PDF::API2::Basic::PDF::Dict should access dict entries with e.g. "key_exists", which checks if value of this key resolves to "null" and would return "false", and a getter "get_value", which similarly returns same value (e.g. undef) for non-existing keys or if value is "null". In practice, it will never happen. Given the way PDF::API2 was designed to traverse pages tree, I think the cheapest (and rather safe, regardless of what said above) way to fix the present RT would be with my patch.

Tue Dec 17 09:08:23 2019 PMPERRY [...] cpan.org - Correspondence added

Vadim, I asked about 130722, as both bugs were referencing 'find_prop' in their error messages. It sounds like your opinion is that they are completely unrelated, and that 130722 is strictly due to corruption of the PDF produced by someone's editing of it. My invitation still stands for the OP to show that some commercial product produced the bad PDF, in which case it might be appropriate to put something in PDF::API2/Builder to deal with it. And of course, if PDF::API2/Builder created the corruption in the first place, it would have to be patched!

121911 was just a red herring ("scarecrow"?) in having a similar line number? I was thrown off the trail by your mentioning it.

If I read you correctly, you feel that your patch for this (131147) bug is more general than Klaus's patch, and there is no need to also apply his patch (harmless but now superfluous). Is that correct, or should his patch also be applied for some reason? I should probably copy-paste your explanation into the documentation somewhere, while I'm at it.

Finally, thank you to both of you for making the effort to track this down and create patches. Good job!

PhilterPaper commented 4 years ago

Wed Dec 18 18:02:33 2019 futuramedium [...] yandex.ru - Correspondence added

If I read you correctly, you feel that your patch for this (131147) bug is more general than Klaus's patch, and there is no need to also apply his patch

Correct, mine can possibly address other issues (unknown yet), and it's either of the two, but if you decide to choose more cautious way (Klaus's patch) rather than more radical, then also search where to (cautiously) patch so that "page" call (2nd one-liner) won't fail. As to documentation, maybe a link to this RT in a comment in source is more than enough.

...about 130722, as both bugs were referencing 'find_prop' in their error messages. It sounds like your opinion is that they are completely unrelated

They are related in a sense that "PDF::API2::Basic::PDF::Pages-> find_prop" method is called on object of unsuitable class -- PDF::API2::Basic::PDF::Objind (130722) or PDF::API2::Basic::PDF::Null (131147). But objects became blessed into these classes for unrelated reasons. When PDF is broken (130722) we can't exclude possibility of similar mismatch anywhere. In fact, it's a good thing that PDF::API2 fails early and noisily. Acrobat doesn't report any issue, but when I try to import a new page in UI (it was some time ago, IIRC insert one before the last... doesn't matter) then there were bizarre results in both visual feedback ("Pages" docker) and cryptic error message. Though at first file was opened as valid! Returning to 131147, I think with my patch, PDF::API2::Basic::PDF::Null is prevented from appearing in pages tree when bubbling up from a leaf to root. As to my previous concerns (is suggested patch too radical, can it break anything)... After some thought, I can imagine only some weird steganography would be disrupted, if any is ever used in PDF :) Otherwise, should be safe.

Thu Dec 19 02:47:46 2019 Klaus [...] Ethgen.ch - Correspondence added

Hi,

Am Do den 19. Dez 2019 um 0:02 schrieb Vadim Repin via RT:

If I read you correctly, you feel that your patch for this (131147) bug is more general than Klaus's patch, and there is no need to also apply his patch

Correct, mine can possibly address other issues (unknown yet), and it's either of the two, but if you decide to choose more cautious way (Klaus's patch) rather than more radical, then also search where to (cautiously) patch so that "page" call (2nd one-liner) won't fail. As to documentation, maybe a link to this RT in a comment in source is more than enough.

I have to say, preventing the Null value in the first place sounds reasonable for me. But as Vadim told, there might still be situations where the Null happens.

So another, maybe supplement patch would be to patch PDF::API2::Basic::PDF::Null class to have NULL getters for several methods (maybe even done with autoload). That would mean to return always undef (or the class itself, depending on the logic). You can even implement a warning here and a configurable switch to croak instead of carp when that happens.

Regards Klaus -- Klaus Ethgen http://www.ethgen.ch/ pub 4096R/4E20AF1C 2011-05-16 Klaus Ethgen Klaus@Ethgen.ch Fingerprint: 85D4 CA42 952C 949B 1753 62B3 79D0 B06F 4E20 AF1C

PhilterPaper commented 4 years ago

As Vadim's patch makes sense, and seems to fix the problem, I'm going to go ahead and close this ticket. It will be in PDF::Builder 3.017, to be released very soon. If something new shows up later on RT 131147, I will consider patching PDF::Builder.

PhilterPaper commented 4 years ago

Wed Feb 05 16:48:39 2020 steve [...] deefs.net - Correspondence added

Agreed with Vadim that dictionary entries with null values should be ignored, per the spec. I've updated the code accordingly.

As Vadim noted, neither his patch nor my version will handle a dictionary with an indirect object pointing to a null value, but I'm okay with waiting for that particular case to come up, hopefully with an associated patch.

Thanks for the bug report and the patch.

Wed Feb 05 16:48:40 2020 steve [...] deefs.net - Status changed from 'open' to 'patched'