Inefficient adding of new pages

In PDF::API2 77 (ssimms/pdfapi2/issues/77), @vadim-160102 reports in "Appending many pages goes quadratic" that the process of adding multiple pages is O(n²):

For e.g. "data merge", VDP printing and similar cases, PDF files with a few thousand pages aren't extraordinary. Content is generated and appended "on the fly" page by page; PDF::API2 can't add pages other than "one by one" anyway.

[slightly modified to run on both API2 and Builder]

use strict;
use warnings;
use Time::HiRes 'time';
use File::Temp 'tempfile';

my $which = 'B'; # A or B for API2 or Builder
if ($which eq 'A') {
    use PDF::API2 2.045;
} else {
    use PDF::Builder;
}

my ( $fh, $tmp ) = tempfile();  # $fh unused filehandle

for (my $n=1000; $n<=5000; $n+=1000) {
    my $t = time();
    my $pdf;
    if ($which eq 'A') {
        $pdf = PDF::API2->new( 'file' => $tmp );
    } else {
        $pdf = PDF::Builder->new( 'file' => $tmp );
    }
    for ( 1 .. $n ) {
        $pdf->page();  # just create a new, empty page after the last
        # this is where the time used goes crazy
    }
    $pdf->save();
    printf("%d\t%.2f\n", $n, time() - $t);
}

[using PDF::API2] It prints:

1000 0.74 2000 3.23 3000 7.47 4000 14.79 5000 24.19

[using PDF::Builder] 1000 0.81 2000 3.25 3000 7.66 4000 14.65 5000 24.02

[Which plots a clearly quadratic curve in both libraries]

Compare:

use PDF::Reuse;
for my $n ( map $_ * 1000, 1 .. 5 ) {
    my $t = time;
    prFile( $tmp );
    for ( 1 .. $n ) {
        prText( 0, 0, 'text' );
        prPage;
    }
    prEnd;
    printf "%d\t%.2f\n", $n, time - $t
}

and:

1000 0.03 2000 0.06 3000 0.08 4000 0.11 5000 0.14

(Had to add at least something to every page with that module)

As Devel::NYTProf reveals, time is wasted in PDF::API2::Basic::PDF::Pages::find_page_recurse. Perhaps appending pages is common enough task, no need to constantly "find" which page to append after i.e. the last.

A patch:

--- PDF/API2/Basic/PDF/Pages orig.pm    Mon Sep 25 20:00:09 2023
+++ PDF/API2/Basic/PDF/Pages.pm Fri Feb 23 11:28:43 2024
@@ -132,15 +132,23 @@
     my ($self, $page, $page_number) = @_;
     my $top = $self->get_top();

-    $page_number = -1 unless defined $page_number and $page_number <= $top->{'Count'}->val();
+    $page_number = -1 unless defined $page_number and $page_number < $top->{'Count'}->val();

     my $previous_page;
     if ($page_number == -1) {
-        $previous_page = $top->find_page($top->{'Count'}->val() - 1);
+        $previous_page = $top->{' last_page'}
+            ? $top->{' last_page'}
+            : $top->find_page($top->{'Count'}->val() - 1);
+        $top->{' last_page'} = $page;
     }
     else {
         $page_number = $top->{'Count'}->val() + $page_number + 1 if $page_number < 0;
-        $previous_page = $top->find_page($page_number);
+        if ($top->{'Count'}->val() == scalar $top->{'Kids'}->realise->elements()) {
+            $previous_page = ($top->{'Kids'}->elements())[$page_number];
+        }
+        else {
+            $previous_page = $top->find_page($page_number);
+        }
     }

     my $parent;
@@ -180,7 +188,7 @@

     my $parent = $self;
     my $max_kids_per_parent = 8; # Why?
-    if (scalar $parent->{'Kids'}->elements() >= $max_kids_per_parent and $parent->{'Parent'} and $page_index < 1) {
+    if (scalar $parent->{'Kids'}->elements() >= $max_kids_per_parent and $parent->{'Parent'} and $page_index < 0) {
         my $grandparent = $parent->{'Parent'}->realise();
         $parent = $parent->new($parent->_pdf(), $grandparent);

Test suite runs OK; benchmark is now tolerable:

1000 0.18 2000 0.32 3000 0.49 4000 0.67 5000 0.84

In the patch, the last line changed is for https://github.com/ssimms/pdfapi2/issues/70 [in Builder already]. 2 shortcuts above it:

The 'last_page' property of pages tree root is created/maintained/used. If PDF::API2 is ever going to shuffle/remove pages, this property should be invalidated by these (non-existent now) calls.
Not for appending, but inserting anywhere in case of "tree" being a simple flat list (***). This shortcut could be used alone instead of (1), which would be slightly slower.

The very 1st line changed by the patch: the <= must be <. I noticed it accidentally while looking at debugging output. Seems harmless, like as if this bug is compensated/cancelled-out somewhere nearby. [OK, the thought of one error "cancelling out" another being "harmless" makes me VERY nervous!]

(+) Please note, large positive argument to PDF::API2::page is clamped to valid range, but large negative one causes a crash with ugly error message. [possibly already fixed in Builder, need to check]

(***) Further note, or rather RFC. ("Patch" is for whomever does VDP, etc. and bumps into performance issues. Slim chance it's accepted into distribution. Then "RFC" is even funnier.) I've never seen issues/slow-downs/RAM-hogs with Reader or other PDF consumer apps if pages "tree" is a flat array/list, even for tens of thousands of pages, compared to a "proper tree". The latest PDF 2.0 spec repeats after its 1990-s predecessor: "balanced tree for performance and/or limited memory". Do they just copy-paste it mindlessly, is/was this consideration relevant? Adding pages into a tree is triggered, in PDF::API2, only very rarely. E.g., benchmarking code above creates a flat list 5000 pages long. OTOH, the PDF::Reuse, also above, does create a tree with no more than 10 branches/leaves per node. Is it worth pursuing this goal e.g. to rebuild_tree?

Regarding the last paragraph, while today's machines are sufficiently virtual memory-rich and fast enough to allow you to get away with many sins, that should not be an excuse for sloppy coding. On the other hand, if "good" coding (better, even "more elegant" algorithms) means that your code becomes convoluted and difficult to understand and maintain, it might be an acceptable tradeoff to go with a simpler algorithm. For example, a Bubble Sort is perfectly acceptable for very small data sets, as its setup time and memory use is low. It's when you start getting into seriously-sized data sets that its O(n²) run time starts to hurt, and you should think about something like a Quicksort.

In some cases, data organized as a long, linear list may be perfectly acceptable, unless you need to "randomly" access some particular member (e.g., a PDF page) within it. In that case, which can happen with PDFs, a well-balanced tree structure would permit finding the desired member the most quickly. If you know for sure that you're going to create a bunch of pages, go through them in order each time, and never try to rearrange them or find a particular page for some purpose, a simple linear list is fine. Otherwise, you want to be able to jump to some page quickly, which recommends a well-balanced tree structure.

I wonder if implementing rebuild_tree() (see #51) would be of help here? It would create a well-balanced page tree, useful for finding a specific page quickly. I'm not sure that the current code does anything beyond breaking up the list of pages into chunks of 8 or so.

The above patch was just released as part of PDF::API2 2.046. I would like to study it a bit and get a better of idea of just what it's doing, and where there might be exposures, before putting it in to PDF::Builder. It probably won't make it to 3.027, but perhaps 3.028.

The latest release of API2 has a somewhat revised add_page():

sub add_page {
    my ($self, $page, $page_number) = @_;
    my $top = $self->get_top();

    $page_number = -1 unless defined $page_number and $page_number < $top->{'Count'}->val();

    my $previous_page;
    if ($page_number == -1) {
        $previous_page = $top->{' last_page'};
        unless (defined $previous_page) {
            $previous_page = $top->find_page($top->{'Count'}->val() - 1);
        }
        $top->{' last_page'} = $page;
    }
    else {
        $page_number = $top->{'Count'}->val() + $page_number + 1 if $page_number < 0;
        $page_number = 0 if $page_number < 0;
        if ($top->{'Count'}->val() == scalar($top->{'Kids'}->realise->elements())) {
            $previous_page = ($top->{'Kids'}->elements())[$page_number];
        }
        else {
            $previous_page = $top->find_page($page_number);
        }
    }

    my $parent;
    if (defined $previous_page->{'Parent'}) {
        $parent = $previous_page->{'Parent'}->realise();
    }
    else {
        $parent = $self;
    }

    my $parent_kid_count = scalar $parent->{'Kids'}->realise->elements();

    my $page_index;
    if ($page_number == -1) {
        $page_index = -1;
    }
    else {
        for ($page_index = 0; $page_index < $parent_kid_count; $page_index++) {
            last if $parent->{'Kids'}{' val'}[$page_index] eq $previous_page;
        }
        $page_index = -1 if $page_index == $parent_kid_count;
    }

    $parent->add_page_recurse($page->realise(), $page_index);
    for ($parent = $page->{'Parent'}; defined $parent->{'Parent'}; $parent = $parent->{'Parent'}->realise()) {
        $parent->set_modified();
        $parent->{'Count'}->realise->{'val'}++;
    }
    $parent->set_modified();
    $parent->{'Count'}->realise->{'val'}++;

    return $page;
}

I wonder if implementing rebuild_tree() (see #51) would be of help here? It would create a well-balanced page tree, useful for finding a specific page quickly.

How's that? Of course it wouldn't help here. Array element access by its index is the fastest possible. And would this "well-balanced page tree" be created from array, or somehow directly from garbage in input file? Therebuild_tree() is only theoretically useful "on file save" as a courtesy for future (and thus, quite hypothetical) PDF consumers of our (created or modified) file, which have to operate under very limited resources (PDF reader for DOS and 640 KB of RAM? Yeah, sure); and which therefore would only have to load/read small portions of Pages tree structure.

The PDF::API2 (and, I think, everybody else these days? Except DOS-like PDF reader, of course) loads the whole tree (be it balanced or not -- can't predict or expect anything, may be just a flat list) and caches pages into array, anyway.

See PDF::API2-> {'pagestack'}, and how PDF::API2::open_page() uses only this array to "find a specific page quickly".

On the other hand, thePDF::API2::Basic::PDF::Pages::find_page(), despite doing seemingly the same thing as method above, does not use this or any other array/cache, but traverses the tree. Why? You probably know history better than me: because it's one (newer) API put on top of another (older) API, with some duplication and/or lack of connection in-between. I wouldn't call the result "sloppy", it's simply good enough for almost everyone using it.

My patch was easy fix for what's required quite often i.e. appending pages. Looking at it again, perhaps it would be less sloppy to implement some kind of similar cache/array at "older API level"; but then it should lead to streamlining everything and quite major re-write.

Also, briefly (sorry for too many words already), a couple things to consider in case of rebuild_tree() idea gets traction. For a new file, it can be done almost transparently. For a file which was opened and its tree modified (a few pages appended, at present -- until insertion/deletion/shuffle is possible), I think implementation should be postponed until something similar to CAM::PDF::cleanoutput exists (i.e. not just incremental update). Plus, inheritance, if it was used in (some parts of)? a tree, would have to be dealt with very carefully on re-build.

(And, sorry but use is executed at compile time, w.r.t. conditional in your test script.)

If the data is stored in a linear array, that should be fairly fast access to any desired element (page), assuming Perl's internal data uses some sort of fixed-stride index. If it's stored in variable-sized blocks (such as a linked list of pages), it may require a read of each block (one at a time) to find the desired one (or in this case, the end of the list). If it's stored in a tree structure (I think a PDF is), you would want it reasonably balanced to minimize the number of node checks needed to reach your desired node. Unfortunately, rebuild_tree() was only implemented as a dummy stub (and not even properly, at that!), so I never pursued it further.

I honestly have not looked very deeply into this stuff, either the internals of a PDF or your proposed fix. I agree that the current API2/Builder code is quadratic in run time (presumably repeatedly scanning the entire length of the array), and could use some sort of improvement. However, I don't want to optimize it for just one specific case, if that could greatly slow down other cases, including display in a PDF Reader! Simply keeping track of the ends of the current list of pages (as opposed to changing the layout of data storage) doesn't sound like it should adversely impact other operations, but I'd want to be comfortable that it doesn't hurt other operations.

I plan to put out 3.027 within a few weeks, but promise to revisit this issue for the next release. If you have any further information on how your fix affects operations other than adding a new page at the end (either in PDF::Builder or in a PDF Reader), I'd be happy to hear about it.

PhilterPaper / Perl-PDF-Builder

Inefficient adding of new pages #211