PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents
https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full
Other
6 stars 7 forks source link

[RT 122962] Reusing PDF::API2 objects for different PDFs #78

Open PhilterPaper opened 7 years ago

PhilterPaper commented 7 years ago

Subject: Reusing PDF::API2 objects for different PDFs

Date: Tue, 5 Sep 2017 12:55:47 +0100 From: Andrew Beverley <andy [...] andybev.com>

Firstly, thanks for a great module. I am using it to generate a PDF with many pages. Producing the whole PDF as one object in one go uses huge amounts of memory, so I now produce each page one-by-one and then concatenate them afterwards using CAM::PDF.

This works well, in that significantly less memory is used, but it is slow, as I am creating a new PDF::API2 object each time.

From the small amount of profiling I have done, a lot of time seems to be spent adding the TTF fonts. I wondered, is there some way to reuse the PDF::API2 object (or just the fonts) and create a fresh page each time?

I have tried various hacks (I won't detail them all here), such as reusing the ttfont object in multiple PDFs, deleting the pages from the object, and so on, but I couldn't get any to work.

Do you have any suggestions please? If you do, and it involves some coding, I would be happy to investigate providing a patch.

Thanks, Andy

PhilterPaper commented 7 years ago

on Tue Sep 05 20:01:42 2017 steve [...] deefs.net - Correspondence added

There are probably some ways to speed up that operation, but depending on what kind of coding you're up for trying, it might be possible to solve your original problem instead.

Take a look at my comments on ticket 113516. Currently, when PDF::API2 opens a file, it reads the whole thing into memory, but that wasn't always the case, and the code that PDF::API2 is built on top of doesn't require that everything be loaded in memory either.

It's theoretically possible for you to create a number of pages, write those out to disk, free up the memory, and repeat, without closing and reopening the file. If you want to start down that trail, look at PDF::API2->finishobjects() and follow the path for details about writing out a file in chunks.

Freeing the memory without closing the file may be trickier (I haven't looked into that yet). I'm guessing it'll involve the release_obj() call in PDF::API2::Basic::PDF::File -- if I'm reading the code correctly, that will remove it from the various caches, but without actually removing it from the PDF. The release() call will almost definitely free the memory, but I think that's only supposed to be called when you're done with the file.

As an aside, several comments in the code mention circular references. As of a release or two ago, those should no longer exist (if you find any, please give me a test case), so that should simplify things.

If you get to a point where you can call finishobjects() more than once and get a working file, but are still running out of memory, let me know (preferably with sample code) and we can dive into that problem more deeply.

If that ends up being too complicated and you'd rather keep trying to speed up the ttfont calls, it should be possible to reuse the time-consuming part of that object's creation. It may be as simple as calling $new_pdf->{'pdf'}->new_obj($font_object_from_old_pdf) instead of $new_pdf->ttfont(...). That definitely wouldn't qualify as intended/supported behavior, but it might work.

-- Steve

on Tue Sep 05 20:01:42 2017 The RT System itself - Status changed from 'new' to 'open'

PhilterPaper commented 7 years ago

on Tue Sep 05 20:01:42 2017 The RT System itself - Status changed from 'new' to 'open'

on Tue Sep 12 06:52:45 2017 andy [...] andybev.com - Correspondence added

Hi Steve, thanks for the quick and comprehensive reply. I've spent a while trying your suggestions (comments below), but am unfortunately no further forward.

At this point I should say that this is more of a nice to have than an essential requirement, so if there are no quick-wins for either of us then I will be happy for you to close the ticket. Have a look at the below if you get the time anyway, and let me know what you think.

Take a look at my comments on ticket 113516. Currently, when PDF::API2 opens a file, it reads the whole thing into memory, but that wasn't always the case, and the code that PDF::API2 is built on top of doesn't require that everything be loaded in memory either.

Thanks. I don't think this particular information helps, as I am writing out, not reading.

It's theoretically possible for you to create a number of pages, write those out to disk, free up the memory, and repeat, without closing and reopening the file. If you want to start down that trail, look at PDF::API2->finishobjects() and follow the path for details about writing out a file in chunks.

Freeing the memory without closing the file may be trickier (I haven't looked into that yet). I'm guessing it'll involve the release_obj() call in PDF::API2::Basic::PDF::File -- if I'm reading the code correctly, that will remove it from the various caches, but without actually removing it from the PDF. The release() call will almost definitely free the memory, but I think that's only supposed to be called when you're done with the file.

I've spent a while playing around with the above. I seem to be able to write out a PDF in chunks, but whenever I try to do so along with calls to free the memory, I run into problems. The finishobjects() in itself doesn't seem to make any difference to memory use, and whenever I try it with something like a save or release_obj then I get:

Can't call method "new_obj" on an undefined value at /usr/share/perl5/PDF/API2/Basic/PDF/Pages.pm line 92

If you get to a point where you can call finishobjects() more than once and get a working file, but are still running out of memory, let me know (preferably with sample code) and we can dive into that problem more deeply.

I should have said before that I am using PDF::TextBlock. I don't think this affects the principle though, as I run into similar problems if I remove it and write lots of text using raw calls.

Anyway, FWIW, here is a MWE:

my $pdf = PDF::API2->new(-file => 'mypdf.pdf'); 

for my $count (1..100) 
{   my $page = $pdf->page;
    my $tb = PDF::TextBlock->new({
           pdf => $pdf,
          page => $page,
          x => 100,
          y => 100, 
    }); 
    for my $count2 (1..20)
    {
         $tb->text("Text $count2");
         $tb->apply;
    }
    $pdf->finishobjects; 
} 
$pdf->save;

If that ends up being too complicated and you'd rather keep trying to speed up the ttfont calls, it should be possible to reuse the time-consuming part of that object's creation. It may be as simple as calling $new_pdf->{'pdf'}->new_obj($font_object_from_old_pdf) instead of $new_pdf->ttfont(...). That definitely wouldn't qualify as intended/supported behavior, but it might work.

Given the relatively modest potential gains, I've decided this is probably best avoided!

Thanks again, and please do feel free to close this ticket if it all looks like too much hassle.

Andy

PhilterPaper commented 6 years ago

Sat Jun 02 17:55:24 2018 steve [...] deefs.net - Status changed from 'open' to 'resolved'

PhilterPaper commented 4 years ago

Revisiting this ticket, it sounds like the basic problem is one of running out of memory while creating a PDF (of multiple pages) in one go. Note that common items like an opened font should probably be done above page level so that it can be shared by multiple pages. If the OP is opening the same font on each page, that could be a problem with many font objects hanging around in memory at once. I would welcome some examples of this problem, and what was tried to fix it.

Combining multiple single-page PDFs (routinely) should be avoided, as each will have its own font objects, etc., leading to unnecessary duplication. That sounds like it was the origin of the request to reuse objects. I suppose a utility could be written to examine the objects in a PDF and consolidate duplicate objects into one (change all references to point to the same one, and erase the duplicates). The XREF table and such offsets would have to be recalculated. If it works, would this be practical to incorporate into the save* methods?