PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents
https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full
Other
6 stars 7 forks source link

from_string() causes infinite loop/memory leak #212

Closed PhilterPaper closed 1 month ago

PhilterPaper commented 4 months ago

In ssimms/pdfapi2/issues/78, @neffets reported that an attempt to read in a PDF (into PDF::API2) appears to create an infinite loop and ever-growing memory usage:

PDF::API2->from_string causes for action "page1-to-thumbnail" an OOM (memory-leak, loop)

We have normal pdf created with "Acrobat PDFMaker for Word". It has only 6 pages.

We try to generate a Thumbnail from the PDF per PDF::API2

$file = '2024_Q1_-digitale-Veranstaltungen_de.pdf';
my $data = do {
    local $/ = undef;
    open my $fh, "<", $file or die "could not open $file: $!";
    <$fh>;
};

$pdf = PDF::API2->from_string($data);
[2024_Q1_-digitale-Veranstaltungen_de.pdf](https://github.com/PhilterPaper/Perl-PDF-Builder/files/14440700/2024_Q1_-digitale-Veranstaltungen_de.pdf)
[tixA78.pl.txt](https://github.com/PhilterPaper/Perl-PDF-Builder/files/14440703/tixA78.pl.txt)

# never gets here

my $sp_pdf = new PDF::API2;
eval {
    $sp_pdf->import_page($pdf,1,0);
};
if ($@) {
    warn $@;
}
my $image = Image::Magick->new;
$error = $image->BlobToImage($sp_pdf->stringify);

It hangs on line 1 (first from_string()) forever, increasing the use resident-memory over time (using 1GB per 300 seconds more)

Workaround is to encapsulate the "from_string" method with an POSIX::sigaction and alarm(10). 2024Q1-digitale-Veranstaltungen_de.pdf tixA78.pl.txt

========================================================================== I would not be surprised that this doesn't work. The header claims to be PDF-1.6, which is beyond what PDF::Builder or PDF::API2 supports, and I see that the first object is an "object stream", which I know is not supported by either library. That alone could well be killing it.

Is there any chance of producing the original PDF at level 1.4? If you can, it would be interesting to see if it works then.

By the way, this seems a rather convoluted way to extract a page and put it into another PDF (or whatever you do with the page image). Consider PDF::API2->open(PDF file) instead of reading it into a file and then doing a from_string().

PhilterPaper commented 3 months ago

...and if all you're doing is creating an image of the first page, it might be even easier to open the PDF in GIMP (Create > Open Webpage, crop to size, scale) and save as a JPEG or PNG image. You might be able to script it to do it for you (GIMP's Python-like scripting language), if you plan to do a lot of these.