Skip junk at end of file

PhilterPaper commented 2 months ago

In ssimms/pdfapi2/issues/80, @justinschoeman reported:

We are encountering a large number of files in the wild with junk at the end (usually html from buggy download pages).

The current open() function in Basic/PDF/File.pm stops after the first 1kB.

The below change continues all the way to the beginning of the file (in a horribly inefficient way - 1k sliding window), but it seems to work:

 #foreach my $offset (1..64) {
 #   $fh->seek($end - 16 * $offset, 0);
 #   $fh->read($buffer, 16 * $offset);
 #   last if $buffer =~ m/startxref($cr|\s*)\d+($cr|\s*)\%\%eof.*?/i;
 #}
 my $scan_length = 16;
 my $scan_start = $end - $scan_length;
 for(;;) {
     $fh->seek($scan_start, 0);
     $fh->read($buffer, $scan_length);
     last if $buffer =~ m/startxref($cr|\s*)\d+($cr|\s*)\%\%eof.*?/i;
     last if $scan_start < 16;
     $scan_start -= 16;
     if($scan_length < 1024) { $scan_length += 16; }
 }

===========================================

Actually, start with $scan_length = 32. The initial 16 is pointless.

my $scan_length = 32;

PhilterPaper commented 2 months ago

I will need to check if PDF::Builder also shows this behavior. It might be a consequence of no longer reading in the entire file, but only '"as much as necessary" (see #34), which has not yet been put into Builder.

Can you tell me what the effects are of PDF::API2 (or Builder) trying to read a PDF with HTML junk on the end? Is it sometimes failing to find the trailer information and object index? I can imagine that happening. Of course, it's no longer a valid PDF file, but if something reasonable can be done to help someone with a corrupted PDF, it wouldn't hurt to do so. Also, Builder might be able to trim off the corruption.

Would simply editing a PDF to add 1kB+ of junk to the end do the job of reproducing such cases?

justinschoeman commented 2 months ago

You can simple append junk (in the files I have seen, it is html). Less than ~980 bytes, and open should succeed. Much more than that it fails, as the pattern is out of the window.

If open() succeeds, then all other operations work fine, regardless of any junk at the end.

when open() fails it gives:

Malformed PDF file pdf_CBXKtLGd at /usr/local/share/perl5/5.32/PDF/API2/Basic/PDF/File.pm line 257.

Basically, it just does not find the trailer.

The current algorithm only searches back 1024 bytes, but in this case, the trailer is 500,000 bytes back...

The above change was proposed in order to match, then extend the existing behaviour, but a more efficient option would probably be:

 my $scan_length = 32;
 my $scan_start = $end - $scan_length;
 for(;;) {
     $fh->seek($scan_start, 0);
     $fh->read($buffer, $scan_length);
     last if $buffer =~ m/startxref($cr|\s*)\d+($cr|\s*)\%\%eof.*?/i;
     last if $scan_start < 16;
     $scan_start -= 16;
     if($scan_length < 256) { $scan_length += 16; }
 }

1) start with at least the last 32 bytes - the first 16 byte read will never match anything 2) limit the window to 256 bytes - i have never seen anything pack enough whitespace/comments between startxref and %%eof to warrant more than that.

PhilterPaper commented 2 months ago

Thanks, Justin, for the additional information. At the moment I'm a bit busy (filing tax returns, preparing for surgery next week, and other stuff*), but soon I hope to make some test PDFs to confirm that PDF::Builder needs this fix too (as I suspect it will). Your fixes look reasonable, but I need to make sure they work for both a cross reference table and a cross reference stream, and when changes have been appended to a PDF (multiple %%eof and cross references). I should be able to get this into the next Builder release, probably early summer.

I'll have to think about whether Builder should try stripping off this "junk" if it's rewriting the file ("Don't touch my junk!" :-) ). As I said above, technically it's a corrupted PDF file, but if it's easily fixable, and this sort of problem has been seen "in the wild", it may be worth fixing such a file. It will probably have to be done anyway, if updates are to be done to the PDF.

* including organizing my eclipse photos and video at https://www.catskilltech.com/Eclipse2024/

PhilterPaper commented 1 month ago

Note to self: combine with fixes for ssimms/pdfapi2/issues/41 and any other ticket (either package) dealing with opening for update, read-only PDFs, missing PDFs, etc. If a PDF is due to be updated, it must be writable anyway, and we can go ahead and chop off any trailing junk. All this stuff should be done smoothly in a consistent manner. I don't know if I'll get it into the upcoming release (3.027), but if not, it will certainly be in the next one.

Need to decide whether, if a PDF is writable, we should just go ahead and truncate any "junk" (with an informational message). That way, we can still find the trailer easily. However, it is possible that a user may not want a PDF truncated for some reason. Also, if a PDF is read/only, we'll still need to do an extended search for the trailer.

Add: We probably shouldn't unilaterally truncate a PDF to remove "junk", unless the file is writable and is to be updated (and even then, give a message to the user). A new option remove_junk for any method that opens an existing PDF would truncate a PDF's junk (if writable), regardless of whether it's to be updated, constituting permission to do this. Otherwise, just search for the trailer a lot further back (possibly all the way to the beginning of the PDF file), per the OP's suggestion.

PhilterPaper / Perl-PDF-Builder

Skip junk at end of file #216