[CTS 19] Bringing PDF::Builder into the 21st century

was: Bringing PDF::API2 into... « October 21, 2016, 04:58:56 PM »

PDF::API2 was originally written to a fairly early version of PDF: somewhere in the range of 1.4 I believe. The official version is up to 1.7 now, and I wouldn't be surprised if a 1.8 comes out at some point.

Users of PDF::API2 have reported a number of bugs when they import or read PDFs produced on 1.5 and higher systems. While there is probably no harm in outputting PDF 1.4 documents (all later readers and tools should be able to handle them), it is a real problem when PDFs produced on other systems (1.5 and up) need to be handled by PDF::API2-based tools. Sometimes bugs can be very subtle, leading to corrupted PDFs being generated. This is not good.

The question is, should PDF::API2 output at a specific level of PDF, selected by the user, or should it simply output (as well as read or import) the latest and greatest (1.7 at this writing)? Is there any point to outputting PDF 1.4, 1.5, or 1.6 level documents? I suppose that some parties may be using paid older tools and generators that can only handle older documents, but almost all end-users (free readers) will be able to handle 1.7 or higher. How about defaulting to 1.7 (or up) output, unless the user of the PDF::API2 tool explicitly requests lower level (1.4, 1.5, or 1.6) output? If the processing involves reading or importing an existing document, if it is higher than the requested output level, the tool would need to bump up the output level, or else fail with an error.

If PDF::API2 is to allow a specific output level (i.e., using no features found in higher levels), we will need to comb through the code and locate any PDF features found only in higher levels, and guard them with a PDF-level flag. Before doing this, the decision needs to be made whether PDF 1.4 is the lowest supported, or if 1.3 and perhaps lower levels will be supported. At some point, a cut needs to be made -- we don't want to support ancient versions at the expense of complicating the code. It's like writing a web page so it will work well with Internet Explorer 6 -- that's just stupid in this day and age!

PDF import functions (readers) will need to flag higher level PDFs and either fail or bump up the output level, unless some way can be found to monitor all PDF being output. For example, if the tool's function is to remove selected pages, and only those pages (being removed) contain PDF 1.7 constructs, and the remaining pages are PDF 1.4, should such a tool declare the output PDF to be 1.4, or just leave it at 1.7? Would the entire PDF have to be examined (in memory) to determine its PDF level, before any output is written?

There are many issues to be dealt with in making PDF::API2 compatible with current PDF levels, so it should be discussed carefully before we go and invest a lot of time and effort into modifying the code. Let's talk about it!

« May 09, 2017, 11:13:58 AM »

While working in the code, I see a number of references to the PDF 1.7 document to explain something. This leaves me a bit worried that perhaps a maintainer has unwittingly slipped in some code that produces a PDF with content greater than 1.4 (the stated output version). Anyway, maintainers need to be careful to put in code that only produces PDF 1.4, until such time as PDF version controls are implemented.

If a maintainer wants to show some code that would improve the PDF by using post-1.4 features, that can be done as a comment or disabled code (if (0), etc.) for now. Eventually when PDF versions are implemented, it can be properly integrated into the system.

If a maintainer spots something which is definitely post-1.4, that's a bug and should be reported so it can be disabled for now. If there is no workaround for this in PDF 1.4, that might be a good impetus to go ahead and implement PDF versions. I'm particularly worried about annotation changes, and whether they are PDF 1.4 or later.

« July 22, 2017, 08:18:17 PM »

See https://en.wikipedia.org/wiki/Portable_Document_Format for an overview of PDF and its history (including PostScript predecessor). Anything here not PDF-1.4 would be a nice add, but first requires either versioning (or we just say, "to hell with it, just make all PDFs 1.7"). See also requests for Archival modes and Handicapped Accessibility.

Do we have a complete implementation of AcroForms? It's been around since PDF-1.2. Starting with PDF-1.5, there is Adobe XML Forms Architecture (XFA), which we may want to consider. There are also a number of other features which may not currently be fully implemented, or are post-1.4. Finally, the discussion on PostScript, with its programming control logic, leads to an interesting thought: enhanced PS (or a similar language) input to produce normal PDF pages. Perhaps something like this should be a separate wrapper around PDF::Builder, as could be other preprocessors for HTML, Markdown, and even LaTeX input (for the purpose of producing PDF output).

« November 17, 2017, 12:22:10 PM »

By the way, there is now a PDF 2.0 standard out (https://www.iso.org/standard/63534.html), although I'm not sure how firmly set in stone it is (whether or not it's considered final). It also seems to only be in hardcopy ($$) right now, with no free softcopies yet (e.g., from Adobe).

PDF::Builder still has to come up to speed with 1.7 features, before tackling 2.0, but at least, 2.0 should be kept in mind as something that should be allowed for when doing any major changes with PDF::Builder.

If nothing else, don't make "version number" just "6" for "1.6". We would need the full "1.6".

« December 17, 2017, 05:28:48 PM »

I've been mulling over the issue of PDF level support. One complication is that we probably don't even 100% support any given level of PDF, including 1.0! That is, there are probably a few missing features even at that level. Currently, we output with the claim of PDF 1.4, and so far, that looks valid (I have not seen any features beyond PDF 1.4). It is quite possible that a simple PDF may include no features beyond PDF 1.0, but we'll still call it 1.4, rather than going back through the code and marking every feature that is 1.1, 1.2, 1.3, or 1.4. As we add features beyond 1.4, such as 1.5's cross-reference streams, the output level will have to be bumped up to 1.5 or higher, or else the feature is barred (fatal error?) if we select a 1.4 (or lower) output level. If there is no output until the document is complete, we could allow the output level (in the first line of the PDF) to "float" to whatever we use. If the output level has to be written before processing is complete, we would either have to make a preliminary pass to see what the highest feature level is going to be, or kill the run if a feature (e.g., 1.5 level) is requested beyond the hard limit (e.g., 1.4 level output). Neither is particularly palatable, but we don't want to knowingly output a 1.5 level feature while claiming that the output is 1.4. That could break some PDF processors, readers, and tools.

It gets even worse with reading in an existing PDF. Right now, no check is made whether the input is claiming a PDF level above 1.4, nor whether any higher level features are found in the file. For example, PDF::Builder supposedly reads cross-reference streams (a 1.5 feature), but does not write them. Something we could do is to put a limit on input PDF file claimed level (default 1.4), and issue a warning if the file claims to be of higher level than that (a warning that some features of the input PDF may not be supported, and may cause problems). Or, we could assume that it's possible that the PDF file is actually of lower level than claimed, and only flag on a feature-by-feature basis. It does not appear that a read-in PDF is necessarily broken down into individual features (that could be checked for levels), so you might have PDF 1.7 level features hiding in a PDF that claims to be 1.4!

At a minimum, the output level should be at least the highest level of any PDF read in, which is what the PDF spec recommends. Beyond that, we could issue a warning for input files above some level, and could float the output level to any higher output features, up to some limit (beyond which the run is killed). So, there may be separate input and output PDF level limits (hard or soft), or just one (for output).

The following changes have been made for PDF::Builder 3.011 (out by the end of 2018):

new() add option -outver setting (default 1.4) to set the starting PDF version number.
new() add option -msgver setting (default 1) to control whether a warning message is issued when the output PDF version (see -outver) is changed by verCheckInput() or verCheckOutput(). 0 = suppress message, 1 = output to STDOUT.
add method verCheckInput() to bump up the output PDF version upon reading a PDF file with a higher version number than current. Note that this does NOT guarantee that higher level PDF constructs will be properly handled, only that the output version number is bumped up, which the code was sort of doing before.
add method verCheckOutput() to bump up the output PDF version (if necessary) for any PDF feature greater than 1.4. Currently there are no such features implemented (there are cross-reference streams (1.5), but only upon being read in from a PDF of version 1.5 or higher). If I get the new PNG image code working (which uses libpng.a), that will be the first (PDF 1.5 if 16-bit samples).
the version() method warns you if you attempt to decrease the output PDF version.

Two t-tests have been updated to use -outver to output PDF 1.5, to avoid a warning message when their PDF 1.5 sample files are read in. The alternative would have been to use -msgver to suppress the warning message, so that the t-test won't have extra output.

I will leave this thread open for a while, to see if any further work needs to be done with this work item.

According to PDF spec 7.5.2,

Beginning with PDF 1.4, the Version entry in the document’s catalog dictionary (located via the Root entry in the file’s trailer, as described in 7.5.5, "File Trailer"), if present, shall be used instead of the version specified in the Header.

NOTE This allows a conforming writer to update the version using an incremental update (see 7.5.6, "Incremental Updates").

Under some conditions, a conforming reader may be able to process PDF files conforming to a later version than it was designed to accept. New PDF features are often introduced in such a way that they can safely be ignored by a conforming reader that does not understand them (see I.2, "PDF Version Numbers").

What can one make of that? (no Stephen Stucker Airplane jokes, please!) Does it mean that if the PDF version in the header conflicts with any version given in the Root, the latter wins? If the header version is updated, should any Root version also be updated?

Add: at some point in the last year or two I did add code to recognize if Root PDF version overrides the header version, and earlier than that the header version gets updated if necessary (e.g., using a PDF 1.5 feature). However, nothing is done to coordinate the two on write-out or editing of an existing PDF.

As discussed in #197, we may need to add Object Stream tolerance to PDF::Builder (it's a 1.5 level feature). At the very least, Builder should recognize if the claimed PDF version is higher than 1.4, and ease off on the integrity checks. At best, Object Streams might be properly handled (or converted to 1.4-level equivalent).

As discussed in #198, some integers with leading 0's and containing only digits 0..7 may be interpreted as octal values. This would not be a PDF level-specific issue, but how a Reader is implemented (also, Perl seems to do this, so Builder may be affected).

As discussed in several other issues, including #167, some Readers apparently cannot handle certain compression methods and image layouts. I don't know if this is something Builder can detect and warn about, or if we should simply document such things and let the user beware.

PhilterPaper / Perl-PDF-Builder

[CTS 19] Bringing PDF::Builder into the 21st century #93