Reading and modifying a read-only PDF creates corrupted output

PhilterPaper commented 1 month ago

In ssimms/pdfapi2/issues/86, @muehlenp reports:

If the input PDF file is read-only the output PDF is corrupt. E.g. KDE's okular shows "Some errors were found in the document, Okular might not be able to show the content correctly" and Ghostscript prints "Dereference of free object 22, next object number as offset failed (code = -18), returning NULL object." - anyway, both show the PDF. I was told, some windows PDF apps can't view the files. With the old version v2.038 of PDF::API2 this does not happen.

To reproduce I used this perl script and the attached input pdf:

#!/usr/bin/perl -w
use PDF::API2;

my $pdf=PDF::API2->open('test_input.pdf');
my $page=$pdf->openpage(1);
my $text=$page->text();
my $font=$pdf->font('Helvetica-Bold');
$text->font($font, 20);
$text->position(200, 700);
$text->text('Hello World!');
$pdf->saveas('test_output.pdf');

Sample input PDF: test_input.pdf Running the script with a read-write input file, the output is ok: test_output_from_rw-ok.pdf After making the input file read-only (e.g. chmod 444 test_input.pdf), the output is corrupt: test_output_from_ro-error.pdf

Tested with versions 2.043, 2.045 and 2.047 on Ubuntu and openSuse systems.

PhilterPaper: I have not yet tried reproducing this. On my Windows system it should be possible to use attrib to set the Read-Only flag.

PhilterPaper commented 1 month ago

Well, I did attrib +R test_input.pdf and it was apparently Read-Only (could not erase it). Ran the script (changed to use PDF::Builder) and the resulting test_output.pdf opened successfully on every Reader I tried (under Windows 10): Adobe Acrobat Reader, XpdfReader, Firefox, Chrome, and MS Edge. Either 'Read-Only' means something different between Windows and Linux, or the bug is not in PDF::Builder.

As I can not reproduce it at this time, I will just have to set it aside and see what happens on API2. If someone wants to try it with PDF::Builder 3.026 (or 3.027 when it comes out), especially on Linux, that would be appreciated.

muehlenp commented 1 month ago

Thanks for pointing out this module, I missed it until now. I tried with 3.026 on Ubuntu and can confirm PDF::Builder works properly with the test script. Will check further with my project where the issue came up.

PhilterPaper commented 1 month ago

OK, good to hear. My guess is that some change was made to PDF::API2 that broke something. I don't do bug fixes on PDF::API2, so I will make no attempt to fix it, but any fix someone provides for PDF::API2 I will check to see if it's needed in PDF::Builder.

I missed it until now.

PDF::Builder is intended to be very compatible with PDF::API2 (actually, a superset of its function), so go ahead and try it for your applications!

vadim-160102 commented 1 month ago

...some change was made to PDF::API2 that broke something ... I will check to see if it's needed in PDF::Builder.

Based on "Changes", I think it could happen relatively long ago, to 2.039, with input files no longer needlessly being slurped in full. Noble effort, which I don't see was ported to PDF::Builder, but maybe very large files are not as important for you. At 1st glance (but not thoroughly tested), a fix to PDF::API2 could be an injection of

$self->{' update'} = 1;

somewhere after this line: (https://github.com/ssimms/pdfapi2/blob/32af5b86539b92b8e4333168b6ce05244613e72e/lib/PDF/API2/Basic/PDF/File.pm#L509)

And then script in original ticket, as well as further benchmark, no longer produce corrupt output. As to "needed in PDF::Builder", compare (to be run in Windows; for lack of better test subject I'm creating a stupid 100+ MB file (which is not "very large", btw), using ImageMagick):

use strict;
use warnings;
use feature 'say';
use Time::HiRes 'time';
use PDF::API2;

my $fn = 'bigfile.pdf';
system "convert rose: -scale 10000% -compress none $fn"
    unless -e $fn;
system "attrib +R $fn";

sub group { $_[ 0 ] =~ s/(\d{1,3}?)(?=(\d{3})+$)/$1,/gr }
sub mem {
    group 
        qx{ typeperf "\\Process(perl)\\Working Set Peak" -sc 1 } 
            =~ /(\d+)\.\d+\"$/m
}

say "<<< Let's add a blank page! >>>";
printf "source file size : %s\n", group -s $fn; 

my $t = time;

my $pdf = PDF::API2-> open( $fn );
$pdf-> page;
$pdf-> save( $fn =~ s/(?=\.pdf$)/+/r );

printf "time             : %.3f\n", time - $t; 
printf "peak memory used : %s\n",   mem;

<<< Let's add a blank page! >>>
source file size : 122,405,882
time             : 0.331
peak memory used : 36,315,136

While with PDF::Builder:

<<< Let's add a blank page! >>>
source file size : 122,405,882
time             : 1.739
peak memory used : 531,079,168

But as I said, large file users may be niche audience considered unimportant :-)

PhilterPaper commented 1 month ago

I'm not convinced that it is necessary to read in a PDF a little at a time, rather than the entire thing at a time. These days we have good virtual memory support, and unless a PDF file is ridiculously large, it should fit in memory. If someone can show realistic PDF files "in the wild" that cause problems that the PDF::API2 RT 113516 patch was trying to fix, I may change my mind. Otherwise, I would like to keep things simple by not getting into complex schemes to manually manage memory use. See #34 too. I seem to recall that there are one or two tickets against PDF::API2 that appear to be a consequence of the RT 113516 patch.

I've been around long enough to remember even mainframe applications that had to manually allocate memory use via overlays and passes. Then VM became a real thing, and everyone could largely quit worrying about their memory footprint.

vadim-160102 commented 1 month ago

That is perfectly fine, no worries. But just to illustrate why I think PDF::API2's was a move in the right direction, though I'm getting completely off topic in this ticket (input will no longer be read-only), let's make the 1st system call unconditional and eliminate the 2nd, in my example script; and then either update or save to the same file name in the end. The timing now becomes

PDF::API2             0.005
PDF::API2 (2.038)     0.778
PDF::Builder          1.555

(I'm not saying this difference is only because the original file was read to RAM in full or not, but because of better handling of required modifications to PDF, in general. Which would suit my use cases -- mostly small incremental in-place tweaks/changes (or just inspect/report) of substantially larger original PDF files -- but I'm using modified CAM::PDF for that :-) )

PhilterPaper commented 1 month ago

Hmm. While that is a noticeable difference in run time, and may be due to not reading so much at once, I'm still leery of doing manual memory management. At least one earlier bug appeared to be caused by the changes, and then there's this one. It's a major change to the architecture of the system and I'm reluctant to make such changes if the only positive result is running a bit faster. I'll have to think about it some more, and may wait until more bugs have been shaken out of it.

Note that, despite the title, this is modifying a copy of a read-only PDF, not the original. It's odd that (I presume) you could take a read-only PDF file, copy it to a read-write file, and do such editing without a problem. A PDF can be encrypted too, but I suspect that's a separate issue from this one. Are there any other "global" attributes to be concerned about? Does the PDF "know" internally that it's read-only (or the surprise comes when you attempt to write the file back out)?

I recall a change to PDF::API2 a few years back that disallowed opening a file for update if the file was read-only. I'm not sure if that's what is in play here... I'm not sure I put it in PDF::Builder (it still may be on my to-do list).

PhilterPaper / Perl-PDF-Builder

Reading and modifying a read-only PDF creates corrupted output #220