PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents
https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full
Other
6 stars 7 forks source link

[RT 131657] pdfapi2 regression 0ab55cd53 [Testcase] #112

Closed PhilterPaper closed 3 years ago

PhilterPaper commented 4 years ago

Mon Feb 03 06:05:43 2020 winter [...] bfw-online.de - Ticket created Subject: pdfapi2 regression 0ab55cd53 [Testcase] Date: Mon, 3 Feb 2020 11:49:28 +0100 To: bug-PDF-API2 [...] rt.cpan.org From: Leon Winter <winter [...] bfw-online.de>

Hi,

we are using your PDF::API2 module (via Debian) and just upgraded to a more recent version where we noticed a bug:

Deep recursion on subroutine "PDF::API2::Basic::PDF::Objind::release" at /usr/share/perl5/PDF/API2/Basic/PDF/Objind.pm line 123. at /usr/share/perl5/PDF/API2/Basic/PDF/Objind.pm line 122. PDF::API2::Basic::PDF::Objind::release(PDF::API2::Outline=HASH(0x583c1788)) called at /usr/share/perl5/PDF/API2/Basic/PDF/Objind.pm line 122

I could identify the commit 0ab55cd535b3b3ac7f616589d00ff00864626030 as culprit. Using the current version (at master) with this commit reverted we do not run into this bug. We use the module to modify a 200+-page PDF to add outlines and titles.

See my attached test case (Perl script) which I run on a empty 200+ page PDF file which I also attached.

I created the PDF file this way:

convert xc:none -page A4 a.pdf
 pdftk A=blank.pdf cat A A A A A A A A A A A A A A A A A A A A A A A A A A A A A
 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A
 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A
 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A
 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A
 A A A A A A A A A A A A A A A A A A A A A A A A A A output blank217.pdf

Then run the script with a current pdfapi2 in path, I did: PERL5LIB=pdfapi2/lib perl -w testcase.pl

Output after regression: Deep recursion on subroutine "PDF::API2::Basic::PDF::Objind::release" at pdfapi2/lib/PDF/API2/Basic/PDF/Objind.pm line 123. [...]

Output before regression (0ab55cd535b3b3ac7f616589d00ff00864626030): (nothing)

Thanks, Leon Modify message

PhilterPaper commented 4 years ago

Tue Feb 11 12:35:18 2020 futuramedium [...] yandex.ru - Correspondence added

Well, I had some time today, what follows is maybe not a patch "as is", rather invitation for a review. Plus, I didn't investigate why there were no warnings before mentioned commit -- just looking at current code (though I suspect it is related to efforts to prevent memory leaks, circular refs, etc., therefore there's no way back to "before that commit").

Reduced to SSCCE, the test would be:

use strict;
use warnings;
use PDF::API2;

my $pdf = PDF::API2-> open( 'blank217.pdf' );
my $outlines = $pdf-> outlines;
$outlines-> outline-> dest( $pdf-> openpage( $_ ))
    for 0 .. 100;
$pdf-> { pdf }-> release;

There are from 0 to quite a few warnings then, different from run to run, and only reason for that is "values" returns random order in line 115 https://metacpan.org/release/PDF-API2/source/lib/PDF/API2/Basic/PDF/Objind.pm#L115

Hm-m, OK. The Outlines tree is quite inter-connected, with Prev/Next links for intermediate/leaf, First/Last for root/intermediate, and Parent for all but root. I suspect, because of that, with just 101 bookmarks we easily (but not always) hit the deep recursion limit, then.

Actually, many PDF structures are similarly inter-connected -- e.g. consider a pages tree. I think, though didn't try to confirm, that a relatively (maybe a bit more) complex pages tree can trigger the same warnings when "released", because of Parent member.

I think, with Outlines tree, we don't have to follow any of Parent/First/Last/Prev/Next links to manually "release" them. For simple reason, that leafs/intermediate nodes are members of ' children' arrays. In fact, matter-of-fact autovivification of these arrays (https://metacpan.org/release/PDF-API2/source/lib/PDF/API2/Outline.pm#L131) is somewhat ...disturbing. OK, what I'm suggesting, is that said "Parent/First/Last/Prev/Next" values are weakened, and "weak" refs ARE NOT added to queue to be manually released.

For Outlines, they are released anyway because there are stored in ' children'. For Pages, any "Parent" is already weakened if I'm reading the code right. With patches applied, there are no warnings however many times I run the test. I hope there are no memory leaks then. For structures other than Outlines or Pages, if PDF::API2 ever deals with them, the solution would be then to weaken refs which are released through other links -- then we don't hit "deep recursion" for trees of any complexity. What do you think? :)

--- PDF/API2/Outline_old.pm     Thu Feb  6 01:08:40 2020
+++ PDF/API2/Outline.pm Tue Feb 11 20:16:25 2020
@@ -32,36 +32,43 @@
     $self->{'Prev'}    = $prev   if defined $prev;
     $self->{' api'}    = $api;
     weaken $self->{' api'};
+    weaken $self-> {'Parent'} if defined $parent;
+    weaken $self-> {'Prev'} if defined $prev;
     return $self;
 }

 sub parent {
     my $self = shift();
     $self->{'Parent'} = shift() if defined $_[0];
+    weaken $self-> {'Parent'};
     return $self->{'Parent'};
 }

 sub prev {
     my $self = shift();
     $self->{'Prev'} = shift() if defined $_[0];
+    weaken $self-> {'Prev'};
     return $self->{'Prev'};
 }

 sub next {
     my $self = shift();
     $self->{'Next'} = shift() if defined $_[0];
+    weaken $self-> {'Next'};
     return $self->{'Next'};
 }

 sub first {
     my $self = shift();
     $self->{'First'} = $self->{' children'}->

    if defined $self->{' children'} and defined $self->{' children'}->
    ;

+    weaken $self-> {'First'};
     return $self->{'First'};
 }

 sub last {
     my $self = shift();
     $self->{'Last'} = $self->{' children'}->[-1] if defined $self->{' children'} and defined $self->{' children'}->[-1];
+    weaken $self-> {'Last'};
     return $self->{'Last'};
 }
---------------------------------
--- PDF/API2/Basic/PDF/Objind_old.pm    Thu Feb  6 01:08:40 2020
+++ PDF/API2/Basic/PDF/Objind.pm        Tue Feb 11 20:17:34 2020
@@ -14,6 +14,7 @@

 use strict;
 use warnings;
+use Scalar::Util 'isweak';

 our $VERSION = '2.037'; # VERSION

@@ -112,7 +113,7 @@
 sub release {
     my ($self) = @_;

-    my @tofree = values %$self;
+    my @tofree = grep { !isweak $_ } values %$self;
     %$self = ();

     while (my $item = shift @tofree) {

Tue Feb 11 12:41:58 2020 futuramedium [...] yandex.ru - Correspondence added

Sorry, it looks I added "fixed in 2.037"; I tried to add "my line numbers are for 2.037".

PhilterPaper commented 4 years ago

Thu Feb 13 09:30:19 2020 winter [...] bfw-online.de - Correspondence added

I can confirm that your patch will silence the warnings (also in our production code).

Thanks

PhilterPaper commented 3 years ago

Tue Nov 10 18:59:23 2020 PMPERRY [...] cpan.org - Correspondence added

I see that the recommended changes were NOT put in PDF::API2 2.038. I tried running the latest PDF::Builder (3.020 prerelease) and got quite a few of the "Deep recursion" warnings. OK, to be expected. Then I made the recommended changes to the two files and ran again. No "Deep recursion" messages (or any other messages), but the PDF file produced will not load (damaged and not repairable).

I took a look through the PDF (out.pdf) and it looks quite strange. Tons and tons of 3 byte Filter objects, and the cross reference table looks very odd: lots of repeated groups of '0000000000 65535 f' followed by 7 copies of '0000079535 00000 n'. The starting addresses vary, of course. I'll try to attach a copy of out.pdf. Any ideas? The OP said it stopped the error messages, but didn't say if it produced a good PDF.

I did the same thing with PDF::API2 2.038, and got the same results.

out.pdf

PhilterPaper commented 3 years ago

Sat Nov 14 17:22:46 2020 futuramedium [...] yandex.ru - Correspondence added

the PDF file produced will not load

Looks like you forgot to binmode output handle under Windows.

I took a look through the PDF (out.pdf) and it looks quite strange.

Is "blank217.pdf" different? I think the command it was generated with and periodic patterns you observe are linked somehow. Things strange, odd, or (perceived as) ugly or inefficient disturb humans. Machines don't care as long as formal syntax is OK.

PhilterPaper commented 3 years ago

Sun Nov 15 18:03:57 2020 PMPERRY [...] cpan.org - Correspondence added

On Sat Nov 14 17:22:46 2020, vadimr wrote:

the PDF file produced will not load

Looks like you forgot to binmode output handle under Windows.

Interesting. The original testcase.pl at the very end is (minus the binmode statement):

    $out = $pdf->stringify ();
    $pdf->end;

    open my $pdf_out, '>', 'out.pdf';
    binmode $pdf_out;          # added per Vadim Repin
    print $pdf_out $out;
    close $pdf_out;

The resulting PDF loads cleanly now, and appears (unlike blank217) 2-up with an outline (bookmarks). Is that the intended result? If so, I think this can be marked "solved"!