PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents
https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full
Other
6 stars 7 forks source link

[RT 113293] Corrupted PDF produced by importPageIntoForm #32

Closed PhilterPaper closed 7 years ago

PhilterPaper commented 7 years ago

Wed Mar 23 11:09:38 2016 melmothx [...] gmail.com - Ticket created Subject: Corrupted PDF produced by importPageIntoForm Date: Wed, 23 Mar 2016 16:09:15 +0100 To: bug-PDF-API2 [...] rt.cpan.org From: Marco Pessotto <melmothx [...] gmail.com>

Hi there!

It looks like I've hit a bug, but honestly I don't know where to start looking.

Attached you can find two (apparently) identical PDFs, one produced by xetex and one by luatex. When I import the one produced by luatex, the second page get lost.

Evince just displays an empty page, while mupdf says:

error: stack overflow warning: Ignoring errors during rendering mupdf: warning: Errors found on page

Code to reproduce:

#!perl
use strict;
use warnings;
use PDF::Builder;

for my $file (qw/xetex luatex/) {
    my $in = PDF::Builder->open("t/resources/$file.pdf");
    my $out = PDF::Builder->new;
    my $page = $out->page;
    my $gfx = $page->gfx;
    for my $p (1, 2) {
        if (my $included = $out->importPageIntoForm($in, $p)) {
            print "Including $p\n";
            $gfx->formimage($included, 100 * $p, 0)
        }
    }
    $out->saveas("t/resources/$file.out.pdf");
    $out->end;
    $in->end;
}

Please let know if I can be of any help here.

Best wishes

-- Marco # Subject: [rt.cpan.org RT 113293] Date: Thu, 24 Mar 2016 10:48:00 -0400 To: bug-PDF-API2 [...] rt.cpan.org From: Phil M Perry

Do you happen to know what PDF version is being output by XeTeX and LuaTeX? If they're greater than 1.3 (or maybe 1.4), they may have content that breaks PDF::Builder. If you can throw a switch somewhere to output PDF 1.3 on these programs, could you see if the problem persists? One thing PDF::Builder needs is work to bring it up to PDF 1.7 compatibility, so it can safely import/load PDFs produced by other programs. # Thu Mar 24 10:47:50 2016 The RT System itself - Status changed from 'new' to 'open' # Thu Mar 24 11:12:31 2016 melmothx [...] gmail.com - Correspondence added Subject: Re: [rt.cpan.org RT 113293] Date: Thu, 24 Mar 2016 16:12:17 +0100 To: bug-PDF-API2 [...] rt.cpan.org From: Marco Pessotto <melmothx [...] gmail.com>

XeTeX and LuaTeX usually give 1.5 output. With 1.4 output, things are fine. The files I sent are 1.5. The puzzling thing is that 1.5 produced by XeTeX is fine as well (as you can see from the tests), so yes, apparently luatex is including some content (which one is a good question) the module doesn't like (but doesn't throw an exception either).

-- Marco # Sat Mar 26 10:45:57 2016 steve [...] deefs.net - Correspondence added

Here's what I've found so far:

Using Perl 5.8.5 (the minimum supported version), your test script seg faults while trying to create luatex.out.pdf. It works fine with Perl 5.14.2 (my system Perl).

Looking through both PDFs using contrib/pdf-debug.pl, they're both basically identical until we get to the XObject (form image) on page 2. For luatex.out.pdf, instead of including object 13 (/Contents of page 2), it's including object 2 (an /ObjStm). That explains why the resulting PDF is broken.

Looking at luatex.pdf, object 13 is the second object in the file, so I'm wondering if the object stream contents are getting saved in the wrong slot and overwriting the page contents. It could just be coincidence, but it's where I'll be looking next. # Thu Jun 02 09:55:08 2016 steve [...] deefs.net - Correspondence added

Here's a brief update on this ticket: I spent a large chunk of yesterday trying to isolate the problem, but haven't gotten there yet.

Observations:

That's as far as I got before running out of time yesterday. It's possible I'll be able to work on it some more next week, but after that I'll be away until July. There will be another PDF::Builder release as soon as this issue gets fixed, so if anyone reading this wants to take the above and run with it, I'll happily give you full credit for a patch fixing the bug. :-) # Thu Jun 02 11:33:06 2016 melmothx [...] gmail.com - Correspondence added Subject: Re: [rt.cpan.org RT 113293] Corrupted PDF produced by importPageIntoForm Date: Thu, 02 Jun 2016 17:32:54 +0200 To: bug-PDF-API2 [...] rt.cpan.org From: Marco Pessotto <melmothx [...] gmail.com>

Hello Steve,

I'll try to give a shoot at it as well when I'll grab some brain-fresh time (probably over the weekend).

Best wishes

-- Marco # Fri Jun 03 15:08:04 2016 melmothx [...] gmail.com - Correspondence added Subject: Re: [rt.cpan.org RT 113293] Corrupted PDF produced by importPageIntoForm Date: Fri, 03 Jun 2016 21:07:53 +0200 To: bug-PDF-API2 [...] rt.cpan.org From: Marco Pessotto <melmothx [...] gmail.com>

Some notes, hoping it will ring any bell:

In the $ssk loop, it looks like the contents of the page changes, even if it's not directly passed to the function:

print "Including resource <$s_pdf $k $sk $ssk> $xo->resource($sk,$ssk,walk_obj($self->{apiimportcache}->{$s_pdf},$s_pdf->{pdf},$self->{pdf},$s_page->{$k}->{$sk}->{$ssk})\n";
{
   my  ($content) = $s_page->{Contents}->elementsof;
   print "Page content (before embedding $ssk $s_page->{$k}->{$sk}->{$ssk}) is " . ref($content) . "\n";
}

my $walked = walk_obj($self->{apiimportcache}->{$s_pdf},
                                     $s_pdf->{pdf},
                                     $self->{pdf},
                                     $s_page->{$k}->{$sk}->{$ssk});
{
   my  ($content) = $s_page->{Contents}->elementsof;
   print "Page content (embedding $ssk) is " . ref($content) . "\n";
}
$xo->resource($sk,$ssk,$walked);

And yields:

Page content (before embedding Im1 PDF::Builder::Basic::PDF::Objind=HASH(0x192e8b0)) is PDF::Builder::Content Realising objind 10 Realising objind 16 Realising objind 15 Page content (embedding Im1) is PDF::Builder::Content

Page content (before embedding F25 PDF::Builder::Basic::PDF::Objind=HASH(0x192df50)) is PDF::Builder::Content Realising objind 8 Realising objind 17 Realising objind 20 Realising objind 18 Realising objind 19 Realising objind 21 Page content (embedding F25) is PDF::Builder::Basic::PDF::Dict

Swapping the pages:

Page content (before embedding F25 PDF::Builder::Basic::PDF::Objind=HASH(0x36407e8)) is PDF::Builder::Content Realising objind 8 Realising objind 21 Realising objind 17 Realising objind 20 Realising objind 19 Realising objind 18 Page content (embedding F25) is PDF::Builder::Content

Page content (before embedding Im1 PDF::Builder::Basic::PDF::Objind=HASH(0x3601a80)) is PDF::Builder::Content Realising objind 10 Realising objind 15 Realising objind 16 Page content (embedding Im1) is PDF::Builder::Basic::PDF::Dict

With the xetex file, the page contents is always a Content object. That Dict is the object stream which causes the corruption.

So much for today

-- Marco # Tue Jun 07 13:31:25 2016 melmothx [...] gmail.com - Correspondence added Subject: Re: [rt.cpan.org RT 113293] Corrupted PDF produced by importPageIntoForm Date: Tue, 07 Jun 2016 19:31:08 +0200 To: bug-PDF-API2 [...] rt.cpan.org From: Marco Pessotto <melmothx [...] gmail.com>

Ok, I think at least I found where the problem is:

diff --git a/lib/PDF/API2/Basic/PDF/File.pm b/lib/PDF/API2/Basic/PDF/File.pm
index 0544aca..bd9fca7 100644
--- a/lib/PDF/API2/Basic/PDF/File.pm
+++ b/lib/PDF/API2/Basic/PDF/File.pm
@@ -545,14 +545,9 @@ sub readval {
         $value = $2;
         $str =~ s/^([0-9]+)$ws_char+([0-9]+)$ws_char+obj//s;
         ($obj, $str) = $self->readval($str, %opts, 'objnum' => $num, 'objgen' => $value);
-        if ($result = $self->test_obj($num, $value)) {
-            $result->merge($obj);
-        }
-        else {
             $result = $obj;
             $self->add_obj($result, $num, $value);
             $result->{' realised'} = 1;
-        }
         $str = update($fh, $str) if $update;       # thanks to kundrat@kundrat.sk
         $str =~ s/^endobj//;
     }

This patch appear to fix the issue in the testfile, but it's totally unclear if it has side effects or if it's doing the right thing.

Let's consider this a starting point.

-- Marco # Tue Jun 07 13:41:59 2016 melmothx [...] gmail.com - Correspondence added Subject: Re: [rt.cpan.org RT 113293] Corrupted PDF produced by importPageIntoForm Date: Tue, 07 Jun 2016 19:41:44 +0200 To: bug-PDF-API2 [...] rt.cpan.org From: Marco Pessotto <melmothx [...] gmail.com>

And some notes about the readval method, which is called recursively.

  1. From read_objnum we pass these options:

    ($object) = $self->readval($stream, %opts, objnum => $num, objgen => $gen, update => 0); However, in the readval code, I can't find an usage for objnum and objgen, so it seems to me they are ignored, but could be mistaken.

I suspect those values not obeyed is what lead to the caching problem.

  1. From read_objnum
  my $stream = "$num 0 obj" . substr($objects, $start, $length);
  ($object) = $self->readval($stream, %opts, objnum => $num, objgen => $gen, update => 0);

Unclear if that "$num 0 obj" should be "$num $gen obj" instead. I don't have a PDF with some object revisions, so dunno, just submitted to the attentions.

  1. In readval
  my $update = defined($opts{update}) ? $opts{update} : 1;
  $str = update($fh, $str);

While elsewhere we have $str = update($fh, $str) if $update; So at the begin of the method, update is called even if update => 0 is passed.

Point 2. and 3. are unrelated to the issue in the ticket, and are probably worth a comment in the code or in the doc to clarify it for the next souls wondering in this area of the code.

I'll continue digging on the issue over the next days, assuming I'll have some spare time to throw at this.

Cheers

-- Marco # Wed Jun 08 11:16:06 2016 melmothx [...] gmail.com - Correspondence added Subject: Re: [rt.cpan.org RT 113293] Corrupted PDF produced by importPageIntoForm Date: Wed, 08 Jun 2016 17:15:53 +0200 To: bug-PDF-API2 [...] rt.cpan.org From: Marco Pessotto <melmothx [...] gmail.com>

It looks like the patch is making the pdf.t going into very-very deep recursion, so no, definitively it's not doing the right thing. Kind of stuck here.

-- Marco # Thu Oct 06 23:05:54 2016 steve [...] deefs.net - Correspondence added

Found it. The problem actually got introduced when the cross-reference stream was originally read, due to this gotcha in readxrtr:

for $xmin ($start...$last) {
    # ...
}
# ...
$self->{' maxobj'} = $xmin;

The problem is that the "$xmin" in the for loop is an implicit local variable, so the $xmin outside the loop was still the default (1) instead of $last (27 in your file). When new_obj gets called later on, it increments maxobj to 2, then clobbers what was in that spot, which happened to be the object stream in your file. Everything we saw after that followed from that mistake.

The repo has the fix, and it'll be in the next release. And I'll be very happy to stop banging my head on this problem. :-) # Thu Oct 06 23:06:05 2016 steve [...] deefs.net - Status changed from 'open' to 'resolved' # Thu Oct 06 23:19:59 2016 steve [...] deefs.net - Correspondence added

  1. From read_objnum we pass these options:

    ($object) = $self->readval($stream, %opts, objnum => $num, objgen => $gen, update => 0);

    However, in the readval code, I can't find an usage for objnum and > objgen, so it seems to me they are ignored, but could be mistaken.

    I suspect those values not obeyed is what lead to the caching problem.

They do indeed seem to be ignored. This ended up being unrelated, though.

There doesn't seem to be anywhere else in the codebase that uses them either, so I'm removing them to avoid confusion later.

2. From read_objnum
 my $stream = "$num 0 obj" . substr($objects, $start, $length);
     ($object) = $self->readval($stream, %opts, objnum => $num, objgen => $gen, update => 0);
Unclear if that "$num 0 obj" should be "$num $gen obj" instead.

The generation number in object streams is always going to be zero, so even if objgen weren't being ignored, $gen would always be zero.

Quote

3. In readval
   my $update = defined($opts{update}) ? $opts{update} : 1;
    $str = update($fh, $str);
While elsewere we have $str = update($fh, $str) if $update; So at the begin of the method, update is called even if update => 0 is passed.

Hmm, good catch. It didn't end up mattering for this bug because we're passing the whole stream, so it was just adding garbage to the end of it, which is probably going to be the case any time we set $update to 0 anyway, but I've added the "if $update" to that line as well. # Thu Oct 06 23:23:58 2016 steve [...] deefs.net - Status changed from 'resolved' to 'patched' # Mon Oct 10 09:23:41 2016 steve [...] deefs.net - Status changed from 'patched' to 'resolved' # Mon Oct 10 09:23:42 2016 steve [...] deefs.net - Broken in 2.026 added # Mon Oct 10 09:23:42 2016 steve [...] deefs.net - Fixed in 2.029 added