digital-preservation / pronom-research-week

A persistent repository for PRONOM Research Week activities
11 stars 5 forks source link

Microsoft Publisher signature issues #11

Open thorsted opened 3 years ago

thorsted commented 3 years ago

Three of the formats in PRONOM which are missing signatures are:

x-fmt/254 | Microsoft Publisher | 97 | 0 | pub x-fmt/255 | Microsoft Publisher | 98 | 0 | pub x-fmt/256 | Microsoft Publisher | 2000 | 0 | pub

Version 97, 98 and 2000 files follow the signature of x-fmt/257 (2002 version), all containing the string "MSPublisher.3"

Within the Microsoft Publisher software, when a file is saved, you have the option to save into the Publisher 2000 or Publisher 98 formats, which seems to indicate a difference. I will upload samples to Github for reference. Screen Shot 2020-09-28 at 4 39 21 PM

Should we explore more to identify the individual versions more closely, or should we combine x-fmt/257 to be "97 onward"?

jackdos commented 3 years ago

I think it's worth at least some time to identify actual differences, particularly between versions that Publisher itself seems to differentiate.

thorsted commented 3 years ago

I believe I have found a common pattern between versions. I am not sure what the hex values refer to, but they are consistent among all the samples I have been able to gather.

The current PRONOM signatures for Microsoft Publisher files checks the "CompObj" file with the OLE2 container. The three signatures look for the patterns:

I assume these are major versions of the Publisher format. Everything since Publisher 97 uses the "MSPublisher.3" format.

Within the "Contents" file at the root of the OLE2 container I have found the following pattern: All seem to have "E8AC2200" or 2003+ "E8AC2C00" at offset 0

At Offset 12:

Publisher 2013 onward have the same pattern as the 365 format.

Looking for validation on this method of identification.

thorsted commented 3 years ago

Feel free to test out this new signature pattern here: https://github.com/thorsted/pronom-research-week/tree/master/MicrosoftPublisher