[CTS 25] Problem extracting pages from PDF v. 1.6 documents

PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents

https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full

Other

6 stars 7 forks source link

[CTS 25] Problem extracting pages from PDF v. 1.6 documents #90

Open carylewis opened 6 years ago

carylewis commented 6 years ago

I am trying to import pages from a set of PDFs generated by a third party. I had been using PDF::API2 but have encountered issues where the extracted pages result in PDFs that do not display correctly.

I am encountering the same issues with PDF::Builder.

After extracting one page and saving the document, and verifying the document with ghostscript, I see these errors:

gs -dNOPAUSE -dBATCH -sDEVICE=nullpage new.pdf GPL Ghostscript 9.23 (2018-03-21) Copyright (C) 2018 Artifex Software, Inc. All rights reserved. This software comes with NO WARRANTY: see the file PUBLIC for details. Processing pages 1 through 1. Page 1 **** Error reading a content stream. The page may be incomplete. Output may be incorrect.

Error: File has unbalanced q/Q operators (too many Q's) Output may be incorrect. Error: Form stream has unbalanced q/Q operators (too many q's) Output may be incorrect. Error reading a content stream. The page may be incomplete. Output may be incorrect. Error: File did not complete the page properly and may be damaged. Output may be incorrect.

This file had errors that were repaired or ignored. The file was produced by: >>>> PDF::Builder 3.009 [see https://github.com/PhilterPaper/Perl-PDF-Builder/blob/master/SUPPORT] <<<< Please notify the author of the software that produced this file that it does not conform to Adobe's published PDF specification.

This is the perl script:

use PDF::Builder;

$pdf = PDF::Builder->new();
$old = PDF::Builder->open('orig.pdf');

$page = $pdf->import_page($old, 2); 
$pdf->saveas('new.pdf');

I have attached the orig.pdf.

Thanks for any help or insights you can provide.

I also tried PDF::Extract, which was able to successfully extract the two pages into separate documents, that were displayable, but were not extractable by PDF::Builder.

Converting the orig.pdf to pdf v. 1.4 allows PDF::Builder to work, but using ghostscript to convert the files into 1.4 does not scale very well.

orig.pdf

PhilterPaper commented 6 years ago

PDF::Builder (as well as PDF::API2) is known to have problems with PDFs of version 1.5 and up. I tried splitting all run-together lines (at ^M), but it didn't seem to work, so there may be something else. You say it works OK as version 1.4. I take it you can't create it originally as PDF 1.4?

If you (or someone) can isolate the PDF 1.5+ statements that are causing the trouble, we could consider adding code to support these statements. I will mark this "help wanted" in case someone can offer help.

carylewis commented 6 years ago

I don’t know how to isolate the offending bits. I suspect it’s something to do with the meta data. The copied page is somewhat visible but with lots of weird repeating rectangles, so maybe there something not being copied correctly like image size?

Could it be a character encoding issue?

On Jul 14, 2018, at 11:37 AM, Phil Perry notifications@github.com wrote:

PDF::Builder (as well as PDF::API2) is known to have problems with PDFs of version 1.5 and up. I tried splitting all run-together lines (at ^M), but it didn't seem to work, so there may be something else. You say it works OK as version 1.4. I take it you can't create it originally as PDF 1.4?

If you (or someone) can isolate the PDF 1.5+ statements that are causing the trouble, we could consider adding code to support these statements. I will mark this "help wanted" in case someone can offer help.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

PhilterPaper commented 6 years ago

If it worked when converted to PDF 1.4, I doubt it's a character encoding issue. I suspect there is something at PDF 1.5 or 1.6 that is not being processed correctly. It might very well be in the metadata. I hope to get some time soon to examine it more deeply.

carylewis commented 6 years ago

Thanks for the replies, by the way, it is appreciated.

I agree with you that's its not a encoding issue.

I did some more digging, using itext rups, and it appears as though the PDF::API2 and PDF::Builder can not handle the new pdf v. 1.6 technique of indirect objects.

But the structure of the PDF i uploaded is quite complex, and I can't say what is exactly wrong.

Ghostscript version 9.23 can convert these documents to v. 1.4, but the way it does it seems very different than how the perl libraries do it.

I am willing to help of course, if you come across anything and need someone to do some coding, please let me know.

Does PDF::Builder use PDF::API2?

On Mon, Jul 16, 2018 at 9:08 AM Phil Perry notifications@github.com wrote:

If it worked when converted to PDF 1.4, I doubt it's a character encoding issue. I suspect there is something at PDF 1.5 or 1.6 that is not being processed correctly. It might very well be in the metadata. I hope to get some time soon to examine it more deeply.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PhilterPaper/Perl-PDF-Builder/issues/90#issuecomment-405241514, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkAva4rf6zb5yl3c2-ACAj7_QXvIQIRks5uHJBNgaJpZM4VP3ir .

PhilterPaper commented 6 years ago

Does PDF::Builder use PDF::API2?

PDF::Builder is a fork of PDF::API2. It is built on the PDF::API2 2.029 code base (with updates) and is still largely compatible with PDF::API2. I'm trying to keep existing interfaces as compatible as possible as I fix bugs and add new function.

The direct answer to your question is "no". It does not pull in or use the PDF::API2 library.

carylewis commented 5 years ago

Are there any plans on changing the format of the produced pdf to 1.5 or above?

I’m seeing more and more jp2 files, and pdf 1.4 doesn’t support that fije type. This necessitates converting files to 1.4 and that means converted the jpeg2000 images to tiff which is slow.

On Jul 16, 2018, at 4:42 PM, Phil Perry notifications@github.com wrote:

Does PDF::Builder use PDF::API2?

PDF::Builder is a fork of PDF::API2. It is built on the PDF::API2 2.029 code base (with updates) and is still largely compatible with PDF::API2. I'm trying to keep existing interfaces as compatible as possible as I fix bugs and add new function.

The direct answer to your question is "no". It does not pull in or use the PDF::API2 library.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

PhilterPaper commented 5 years ago

Yes, I need to do something to properly handle read-in PDFs of 1.5+, and allow production of PDF 1.5+. I'm still pondering what the best way to do this would be. I'm not sure that PDF::Builder (or its predecessor, PDF::API2) even fully implements PDF 1.0, much less 1.4, not to mention higher levels. See #93, and contributions and thoughts are welcome.

PhilterPaper commented 5 years ago

A Standing Invitation for Contributors

PDF::Builder is known to be incompatible with many PDFs of level 1.5 and up, when they are read in. I know of only one PDF 1.5+ feature that is implemented (cross reference streams) -- it may be assumed that many features first appearing in 1.5 will cause problems. Hell, not even all of PDF 1.0-1.4 is implemented, so other problems could be encountered.

You are invited to isolate PDF incompatibilities (whether found in 1.5 and up, or in earlier versions), and specifically report them (in a new bug thread). With enough detailed information, I can consider implementing them (code contributions are of course, welcome!). PDF::Builder won't become PDF-1.7 compatible overnight, but at least we can keep chipping away at it.

By the way, does anyone know of a good tool to "dump" a PDF into XML or some other human-readable format? That could make diagnosing and understanding a problem much easier. Even better, the tool can allow hand-editing of the content and convert it back into PDF (binary conversion and compression). There are a few tools that more or less do this (at least, the dump), but they're either expensive or require that the PDF be uploaded to another site. If anyone's looking for a new Perl-based CPAN project, possibly using PDF::Builder as a library, this could be something good!

Incidentally, I have implemented the automatic "bump" of PDF version level for input PDFs and output features (none yet) mentioned in the previous post, so we're ready on that front.

PhilterPaper commented 5 years ago

A note that I have split out the request for JPEG2000 support into a new thread.

The first PDF > 1.4 item released is PNG support for 16 bit samples and interlacing.

Cary, if you (or anyone else) care to chip in with code, or at least, algorithms (or even just pointers to solid documentation or libraries). for handling some of these more advanced features, I'd appreciate it. It sounds like Indirect Objects (1.4?) might be a good priority item to work on, followed by JPEG2000 (1.5). Object /Length fields using X 0 R format are considered indirect objects, and should certainly be supported if they aren't already. Object streams in general (1.5) ought to be looked at, too.

Real-life PDF files that show some of these features would be very useful for development and testing, if anyone can supply them (please be careful about proprietary or legally-protected information!).