digital-preservation / pronom-research-week

A persistent repository for PRONOM Research Week activities
11 stars 5 forks source link

PDF Portfolio 1.7 Files #12

Open jackdos opened 3 years ago

jackdos commented 3 years ago

See https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf (12.3.5) for the following:

"Beginning with PDF 1.7, PDF documents may specify how a conforming reader's user interface presents collections of file attachments, where the attachments are related in structure or content. Such a presentation is called a portable collection.

The intent of portable collections is to present, sort and search collections of related documents embedded in the containing PDF document, such as email archives, photo collections, and engineering bid sets. There is no requirement that documents in a collection have an implicit relationship or even a similarity; however, showing differentiating characteristics of related documents can be helpful for document navigation."

The effect of this feature is to use a file structured as a PDF as a ZIP like container of other content. The intent as expressed in the spec implies that this usage should be treated as distinct from a "standard" PDF file as the intent is not to present a single coherent document, but rather to contain a series of related files. This makes files using this feature much closer to traditional Containers such as ZIP, TAR and WARC, than to traditional Documents such as Word and PDF.

Given this significant change in intent, any software consuming these files has a different purpose, and may have to act in very different ways, from software consuming "standard" PDF documents. Converting the document to HTML markup in the way PDF.js does for example, will not be sufficient to render the file in line with the intention of the creator. For this reason, treating it as a separate format and assigning it a separate PUID makes sense, so that separate Preservation Actions can be taken on files of this type.

The reasoning behind naming the format "PDF Portfolio", rather than anything to do with "collections" is because "Portfolio" is the term that Adobe Acrobat uses in relation to files using this feature.

I have a candidate signature and examples for the 1.7 version, I'm assuming a similar signature will work for 2.0 as well, although I haven't got any examples to test that with.

thorsted commented 3 years ago

Jack,

Do you suggest a simple signature for identification as a PDF Portfolio or should we discuss adding it as a Archive/Container format so the contents may get some identification as well?

jackdos commented 3 years ago

I think a simple signature is a priority so that we can at least recognise these different files. I can see the use of marking it as an archive/container format, but I suspect that to get anything useful from that would require a decent amount of coding work. Adding an issue for that might be worthwhile, I'm not sure I'd get time to look at that in this window though!

thorsted commented 3 years ago

Jack,

Having a little difficulty getting your signature to work for me. I am not seeing any offsets for /Collection or the other strings. Also, the mimetype is set to json and the format priority should be set to 289 instead of the PUID fmt/276.

I like the use of the Collection dictionary, should be a good identifer.

asciim0 commented 3 years ago

I kind of disagree with a specific PUID for portfolio files. My reasons:

I believe that the fact that a PDF can be viewed as a portfolio is something that should be determined at characteristic or risk level, so something that e.g. JHOVE should pick up and not the file format identification chain. If portfolios receive their own PUID, the same argument could be made for PDFs containing attachments, PDFs containing embedded AV streams, etc.

marhop commented 3 years ago

Good point!

If portfolios receive their own PUID, the same argument could be made for PDFs containing attachments, PDFs containing embedded AV streams, etc.

... and for generic WAVE files, WAVE files with PCM encoding, WAVE files with the WAVE_FORMAT_EXTENSIBLE extension, WAVE files with the Broadcast WAVE extension (further splitted into generic, PCM and MPEG encodings), WAVE files with Exif metadata ... SCNR ;-)

It's hard to draw a line between format identification and characterization (or whatever you'd like to call it) and I really wish PRONOM had clear criteria for that.

PS: To be fair, both Broadcast WAVE and Exif audio are commonly advertised as file formats of their own. Personally, I think they are just extended (mainly added metadata) WAVE files, so I added them to the above list.

PPS: Sorry for hijacking this thread with generic rants! Now, back to business.

asciim0 commented 3 years ago

good point - but i think the connection between WAVE / BWF / AV-container of your choice vs. PDF is different. A/V containers have to have the payload in a specific encoding per design. it's an expected behavior of the format in every case. PDF attachments or portfolios or optional as per standard. it's a feature, not a bug ;-D

P.S. and, again, we might need portfolios for every sub-profile of PDF, depending of wheter they allow protfolios per design. so extra PUIDs for every PDF/A, PDF/UA, PDF/X, PDF/VT, etc. that has is presented as a portfolio in addition to the PUIDs already in existence. P.P.S. I would hope for smart PDF readers to turn the portfolio rendering off anyways and just present them as attachments, as it's a pain to load, even in Acrobat.

jackdos commented 3 years ago

@thorsted - good spot on the mime type, recycled files from previous signature work, oops, will fix and update the PR. There are deliberately no offsets for the /Collection or <<CI<< strings as the structure of PDF means that they can be almost anywhere in the file (with the exceptions of the protected BOF and EOF areas, which the overall PDF byte sequences would protect against anyway). I used the PUID rather than the ID specifically because PUIDs are persistent, whereas I'm not sure how constant those IDs actually are.

@asciim0 - I agree there is a risk of format explosion, and I can see the argument that it's not hugely different from a PDF with an attachment, or an embedded video, but I do think Collections are an inherently different use case.

PDF is generically a document. That document can have attachments, but the basic use case is still a single primary document, and some supporting material. I agree that finding those attachments and dealing with them is more about characterising them and dealing with the associated risks than treating a PDF with an embedded word doc as fundamentally a different format from a PDF with an embedded video, from a PDF with no other content.

Portfolios change the nature of the file from essentially a document to essentially a container. There is no "primary document" with attachments hanging off it, there is just a set of equally primary files. This changes the entire purpose of the file, which to me makes it a different enough format to merit it's own entry in PRONOM.

thorsted commented 2 years ago

@jackdos Playing with the new signature released today and it looks like some of my Portfolio samples do not have the "<</CI<<" string. Is this constant in all your samples?

jackdos commented 2 years ago

Hi @thorsted, apologies, only just saw this.

All of the examples I have would have had that string, IIRC that's the specifier for a CollectionItem, and I was assuming that a collection wouldn't really exist without items. Are you seeing additional characters between your object delimiters (<<) and the label (/CI)? Or just not seeing those objects at all?

thorsted commented 2 years ago

@jackdos Not seeing the /CI entirely only the /Collection tag. I'll do some more digging.