Open thorsted opened 3 years ago
I think it's worth at least some time to identify actual differences, particularly between versions that Publisher itself seems to differentiate.
I believe I have found a common pattern between versions. I am not sure what the hex values refer to, but they are consistent among all the samples I have been able to gather.
The current PRONOM signatures for Microsoft Publisher files checks the "CompObj" file with the OLE2 container. The three signatures look for the patterns:
I assume these are major versions of the Publisher format. Everything since Publisher 97 uses the "MSPublisher.3" format.
Within the "Contents" file at the root of the OLE2 container I have found the following pattern: All seem to have "E8AC2200" or 2003+ "E8AC2C00" at offset 0
At Offset 12:
Publisher 2013 onward have the same pattern as the 365 format.
Looking for validation on this method of identification.
Feel free to test out this new signature pattern here: https://github.com/thorsted/pronom-research-week/tree/master/MicrosoftPublisher
Three of the formats in PRONOM which are missing signatures are:
x-fmt/254 | Microsoft Publisher | 97 | 0 | pub x-fmt/255 | Microsoft Publisher | 98 | 0 | pub x-fmt/256 | Microsoft Publisher | 2000 | 0 | pub
Version 97, 98 and 2000 files follow the signature of x-fmt/257 (2002 version), all containing the string "MSPublisher.3"
Within the Microsoft Publisher software, when a file is saved, you have the option to save into the Publisher 2000 or Publisher 98 formats, which seems to indicate a difference. I will upload samples to Github for reference.
Should we explore more to identify the individual versions more closely, or should we combine x-fmt/257 to be "97 onward"?