digital-preservation / pronom-research-week

A persistent repository for PRONOM Research Week activities
11 stars 5 forks source link

WordPerfect 4.2 (fmt/949) signature + sample file #10

Open bitsgalore opened 3 years ago

bitsgalore commented 3 years ago

The format database of the TrID tool has a signature for WordPerfect 4.2 files:

https://file-extension.net/seeker/file_extension_wp

Here's the signature:

0xCB 0A 01

Signature author is Philip Storry. License appears to be AGPL, based on what I found here.

A few years ago I submitted a derived version of the sig to Apache Tika, see commit here.

This adds a 0xCB byte at offset 5, don't remember why I did that (possibly to avoid a collision with another format?), and what it was based on precisely, so proceed with caution!

Test file here (I created this with WordPerfect 6.1 for Windows, running on VM with Windows 3.11):

https://github.com/bitsgalore/tika/raw/8c7c760319b85cfa87c1a8dc3f7cf64278df8710/tika-parsers/src/test/resources/test-documents/testWordPerfect_42.doc

Note that I'm not 100% sure that WordPerfect 4.0 and 4.1 (which are both under the same PUID) have the same signature!

thorsted commented 3 years ago

Johan,

I can confirm the pattern from samples saved from later versions of WordPerfect as well, but I worry it does not represent all files from that time period.

From the many samples I have found from install disks and donations, there is no discernible pattern to WP 4 format. It is simply ascii with format codes.

Here is a link to the WP 4 File Format Specification: https://archive.org/download/wordperfectsdkperfectfit1994/WordPerfect_SDK_PerfectFit1994.iso/51PCSDK%2FWP42FF.TXT

bitsgalore commented 3 years ago

Hi Tyler,

Your response got me curious, so I located and installed a copy of WordPerfect 4.2 for DOS, and did some tests. First I created a document with some text, without applying any formatting, and saved that to file. Here's what it looks like in WP 4.2 (with the reveal codes window at the bottom):

wp42

I then opened it in a Hex editor:

wp42-hex

Which is indeed pure ASCII. Then I went back to WordPerfect, and added a font definition (using the font dialog that opens after pressing Ctrl F8). I then saved the result as a separate file. Here again what this looks like in WordPerfect (note the Font Change code in the reveal window):

wp42-fc

And here's that file in a Hex editor:

wp42-fc-hex

The file is identical to the earlier one, except for the addition of these 6 bytes at the start:

CB 0A 01 0A 01 CB

In the WP 4 spec you linked to you can see this corresponds to a "set pitch and/or font" instruction (the number 313 is the octal representation of 0xCB):

6     313 cb   Set pitch and/or font
          <313><old pitch><old font><new pitch><new font><313>
          If pitch is negative, then it is proportional.

I imagine this may be a pretty common pattern, but I agree this isn't suitable as a signature for identifying WordPerfect 4 files. So yes, you're completely right!

I've uploaded both WordPerfect 4.2 test files here:

wp4-test.zip

thorsted commented 3 years ago

I suppose the font instruction is a common code, especially later GUI versions of WordPerfect probably set this automatically. But probably won't see too many files saved down from later versions in the wild.

For a pure ascii file, there isn't anything different from a regular plain text, therefore a txt identification is probably appropriate.

If we know all the possible formatting codes, we should be able to identify a WP4 file with the right tool.

Thanks for digging into the format, running WP 4 DOS is no easy feat these days, I still need a keyboard cheat sheet!

thorsted commented 3 years ago

Would it be good to add the extension .WP to fmt/949? WordPerfect6 1-SaveAs

emendelson commented 3 years ago

WP 4.2 keyboard cheat sheet (F3, F3):

WP42

bitsgalore commented 3 years ago

@thorsted Yes, adding the extension might be useful, although I'm not sure how commonly any of these extensions were used at the time. For example, when you save a document in WP 4.2 it doesn't enforce or even hint at a particular extension.

emendelson commented 3 years ago

@thorsted I don't think any WP user bothered to use consistent extensions until WP for Windows came along, and users were more or less obliged to use them. I started using WPDOS in 1985 and have never used a ".wp" or ".wpd" extension for any file that it creates, unless I intend to open it easily in WP for Windows.