digital-preservation / pronom-research-week

A persistent repository for PRONOM Research Week activities
11 stars 5 forks source link

fmt/483 EPUB is a stub, does not differentiate between EPUB formats #5

Open archivist-liz opened 3 years ago

archivist-liz commented 3 years ago

See Library of Congress format descriptions: EPUB family: https://www.loc.gov/preservation/digital/formats/fdd/fdd000310.shtml and EPUB 3.0.1: https://www.loc.gov/preservation/digital/formats/fdd/fdd000311.shtml

tnafrancesca commented 3 years ago

So this one (I think) would be tricky to do. The bytes displaying the version number for epub appear to be located OPS/[name].opf within the zip container. As the opf file has a variable name DROID would not be able to identify it via the container signature method. I can't see a clear binary signature differentiating version type either. Would anyone have any other solutions?

thorsted commented 3 years ago

Might be good to add use case here: https://github.com/digital-preservation/pronom/issues/10

gewappnet commented 3 years ago

EPUB is an essential format for national libraries. As there are many fundamental differences between EPUB 2 and 3 for preservation purposes it is very important to record the version number in the technical metadata of the SIP. It is really urgent that there are different PRONOM IDs for the different versions.

Is it really necessary to have first the update of DROID before adding these new entries to PRONOM? We need to have the new PRONOM IDs as soon as possible.

gewappnet commented 3 years ago

The version of EPUB is in the root XML file with the ending .opf within the container as an attribute of package. EPUB 2:

<package xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="bookid"

EPUB 3:

<package xmlns="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" version="3.0" xml:lang="en"

Dclipsham commented 3 years ago

Hi, sorry it took a while to respond.

EPub falls into the category of what we call 'container signatures', that is, the structure of an EPub is that it is a zip containing certain files and folders and we rely on certain elements being consistent and present in order to accurately identify any given format with DROID.

Within a container signature, we're typically looking for one or more specifically named files with the zip, and then usually looking for specific sequences within these files.

That the .opf file contains the namespace and version details for the package is great, but the difficulty for DROID is that the .opf file itself is not consistent in terms of naming or location.

I have a relatively small sample set of 36 EPub files across a range of versions. Within this set I found the following variants for the .opf file:

version 1: content.opf, at the root of the zip

version 2.0: content.opf, within a directory called OPS content.opf, within a directory called OEBPS

version 3.0: package.opf, within a directory called EPUB content.opf, within a directory called EPUB package.opf, within a directory called OPS content.opf, within a directory called OPS content.opf, within a directory called OEBPS

No doubt different epub vendors/software follow different packaging conventions.

The correct way to determine the location and name of the .opf file is with reference to the META-INF/container.xml file at the root of the zip, but unfortunately DROID does not have the current capability to parse internal files to then seek other specific files based on that information, and this is what we mean when we suggest further development would be required to deal with EPub versions, at least optimally.

However, given the finite number of scenarios above, it is possible to create a signature set that covers these scenarios and I attach them beneath. These are test DROID signature files and the contents may not represent the patterns that make it in to an official PRONOM release, so please use with caution, and definitely not within a full production environment.

epub_signatures.zip

Anything that identifies as Epub Non-specific means that the file doesn't conform to the patterns above and has been identified via the existing, generic mechanism. Anybody is welcome to suggest further variants beneath.

If this approach is useful, then I am happy to include it in the next official PRONOM release. I suspect that there will be a long tail of variants that do not conform to these patterns, but if we include what we can as we encounter them then this may go some way to meeting the need you describe in the near term.

gewappnet commented 3 years ago

Test_Droid_epubonly.xlsx Thanks for the test version. We used it with several of our EPUBs and put the results in this Excel sheet. We also used the tool epubcheck by the W3C (https://github.com/w3c/epubcheck) as a reference in the tests. This tool is able to get the version for every EPUB, so it might be helpful to look at its way to extract it.

Is this helpful for the DROID development? Should we further investigate our files?