digital-preservation / droid

DROID (Digital Record and Object Identification)
BSD 3-Clause "New" or "Revised" License
285 stars 75 forks source link

PDF version numbers based on deprecated mechanism #114

Open bitsgalore opened 8 years ago

bitsgalore commented 8 years ago

The other week a colleague sent me an unusual PDF that starts with the following header bytes:

%PDF-1.8 

Needless to say there is no such thing as "PDF 1.8"; closer inspection showed that apart from the erroneous version number in the header it was just an ordinary PDF 1.7. I threw this file at the latest version of DROID; as it turns out DROID completely fails to identify it at all - it won't even say the file is a PDF.

As a test I changed the header line in a hex editor to this:

%PDF-1.7

After this change the file was correctly identified.

I also ran the faulty file through Unix File and Apache Tika. Both tools correctly identified it as application/pdf.

A glance at the PRONOM signature for PDF 1.7 (link here: http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1016&strPageToDisplay=signatures shows that PRONOM/DROID uses the header to identify the PDF version. However, use of the header for defining the version has been deprecated since PDF 1.4! See e.g. the spec of PDF 1.7 at the link below:

http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf

7.5.2 File Header The first line of a PDF file shall be a header consisting of the 5 characters %PDF – followed by a version number of the form 1.N, where N is a digit between 0 and 7. A conforming reader shall accept files with any of the following headers:

%PDF – 1. 0 %PDF – 1. 1 %PDF – 1. 2 %PDF – 1. 3 %PDF – 1. 4 %PDF – 1. 5 %PDF – 1. 6 %PDF – 1. 7

Beginning with PDF 1.4, the Version entry in the document’s catalog dictionary (located via the Root entry in the file’s trailer, as described in 7.5.5, "File Trailer"), if present, shall be used instead of the version specified in the Header.

So as per the spec the version number in the header does not necessarily correspond to the actual version. To reliably establish the version number of a PDF the value in the trailer should be used (if present). This means that the way PRONOM/DROID currently identifies specific PDF versions gives no guarantee whatever of returning the actual version!

Don't see an easy solution for this, since to read the version info in the trailer one needs to completely parse the PDF, which I think is way beyond what a tool like DROID is (or should be) capable of.

As for the faulty "PDF 1.8" file: even though the version number in the header is beyond the range that is allowed by the PDF spec, it's still a bit worrying that it isn't even detected as PDF at all! A possible solution would be to define a generic PDF entry + corresponding signature, where the first byte sequence omits the character that is used for the version number. E.g:

25 50 44 46 2D 31 2E

This would then need to be given lower priority than the more specific PDF PUIDs (for what they're worth, see above comments).

As I cannot share the original PDF I created a file that replicates the problem, see attach:

gmcgath commented 8 years ago

The header isn't deprecated. In fact, it's required. The spec allows the document catalog dictionary to override the version number. The header still can specify the version number, and it would be a strange case that disagreed with the dictionary. A strange case, but a permitted one.

So we still have the problem of identifying the version number reliably without parsing a dictionary.

The spec enumerates the values "%PDF–1. 0" through "%PDF-1. 7" as the ones a conforming reader should accept. By a strict reading, "%PDF-1. 8" isn't supposed to be recognized as PDF. That would make the code reject any future versions of PDF with new numbers, though, so that kind of strictness is probably a bad idea.

I just checked the JHOVE source code, and it looks just for "%PDF-1.".

On 9/26/16 6:04 AM, Johan van der Knijff wrote:

The other week a colleague sent me an unusual PDF that starts with the following header bytes:

 %PDF-1.8

Needless to say there is no such thing as "PDF 1.8"; closer inspection showed that apart from the erroneous version number in the header it was just an ordinary PDF 1.7. I threw this file at the latest version of DROID; as it turns out DROID completely fails to identify it at all - it won't even say the file is a PDF.

As a test I changed the header line in a hex editor to this:

 %PDF-1.7

After this change the file was correctly identified.

I also ran the faulty file through Unix File and Apache Tika. Both tools correctly identified it as application/pdf.

A glance at the PRONOM signature for PDF 1.7 (link here: http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1016&strPageToDisplay=signatures shows that PRONOM/DROID uses the header to identify the PDF version. However, use of the header for defining the version has been deprecated since PDF 1.4! See e.g. the spec of PDF 1.7 at the link below:

http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf

7.5.2 File Header The first line of a PDF file shall be a header consisting of the 5 characters %PDF – followed by a version number of the form 1.N, where N is a digit between 0 and 7. A conforming reader shall accept files with any of the following headers:

%PDF – 1. 0 %PDF – 1. 1 %PDF – 1. 2 %PDF – 1. 3 %PDF – 1. 4 %PDF – 1. 5 %PDF – 1. 6 %PDF – 1. 7

Beginning with PDF 1.4, the Version entry in the document’s catalog dictionary (located via the Root entry in the file’s trailer, as described in 7.5.5, "File Trailer"), if present, shall be used instead of the version specified in the Header.

So as per the spec the version number in the header does not necessarily correspond to the actual version. To reliably establish the version number of a PDF the value in the trailer should be used (if present). This means that the way PRONOM/DROID currently identifies specific PDF versions gives no guarantee whatever of returning the actual version!

Don't see an easy solution for this, since to read the version info in the trailer one needs to completely parse the PDF, which I think is way beyond what a tool like DROID is (or should be) capable of.

Gary McGath, Freelance Writer and Software Developer http://www.garymcgath.com

bitsgalore commented 8 years ago

Perhaps "deprecated" isn't the correct word here (as it's still required), but the fact remains that on its own the value in the header cannot be relied upon to refect the true version of a PDF.

The header still can specify the version number, and it would be a strange case that disagreed with the dictionary.

The most obvious case I can think of are PDFs that were incrementally updated. E.g. it is possible that a PDF started its life as PDF 1.5, and was then updated in a more recent version of Acrobat to 1.6. The addition of incremental updates was also the reason for the change starting with PDF 1.4. There's a pretty good explanation of incremental updates here:

https://blog.didierstevens.com/2008/05/07/solving-a-little-pdf-puzzle/

I suppose such files are pretty rare in most archive/library settings, but I've never seen any data on this. A more serious side effect might be that some software vendors may simply not bother to update the version that is written to the header to the actual version, since the spec says anything goes as long as it is in the 0-7 range ...

I just checked the JHOVE source code, and it looks just for "%PDF-1.".

Yep, that looks like the most sensible option to me as well.

Dclipsham commented 8 years ago

Thanks both,

We missed deadline for this signature release (out today), but happy to add a generic entry next release (probably late-November) and to consider how to go about adjusting what PRONOM assumes for ID.

David