digital-preservation / pronom-research-week

A persistent repository for PRONOM Research Week activities
11 stars 5 forks source link

Mathematica Notebook format #1

Open gleporeNARA opened 4 years ago

gleporeNARA commented 4 years ago

Looking for additional real-world examples from earlier version of Mathematica (say, versions 1 and 2, if they existed). Also wondering about the resource cost in performing an identification on a signature with several long text strings to search for.

Format name: Mathematica Notebook files Version number(s): all? PRONOM fmt/201 - No current signatures on file - http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=926&strPageToDisplay=summary Extensions: nb mime-type: text/plain; charset=us-ascii Description: "Wolfram Mathematica (usually termed Mathematica) is a modern technical computing system spanning most areas of technical computing — including neural networks, machine learning, image processing, geometry, data science, visualizations, and others. The system is used in many technical, scientific, engineering, mathematical, and computing fields. It was conceived by Stephen Wolfram and is developed by Wolfram Research of Champaign, Illinois The Wolfram Language is the programming language used in Mathematica." Format type: Text (Structured) Vendor: Wolfram Research

The signature from the 'file' command is:

# .nb files
#too long 0 string  (***********************************************************************\n\n\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Mathematica-Compatible Notebook Mathematica 3.0 notebook
0   string  (***********************    Mathematica 3.0 notebook

# other (* matches it is a comment start in these langs
# GRR: Too weak; also matches other languages e.g. ML
#0  string  (*  Mathematica, or Pascal, Modula-2 or 3 code text

Below is a list of common strings that appear in these files.

Content-type: application/vnd.wolfram.mathematica Content-type: application/mathematica Wolfram Notebook File Mathematica-Compatible Notebook CreatedBy='Mathematica x.x' http://www.wolfram.com/nb mathematica.zip

samalloing commented 4 years ago

I'm going to try this one if that's ok

gleporeNARA commented 4 years ago

Sam - absolutely! Have at it. I have a large body of files to test your signature on.

samalloing commented 4 years ago

I have created a signature file for a notebook file. The current problem I'm facing (also described above) is that there are multiple versions of Wolfram Notebook and there is only one fmt. Should I create a signature file for every version I can find? Or do I need to handle this differently? cc @Dclipsham

Thanks for any advise

Sam

gleporeNARA commented 4 years ago

Sam I've asked David that very same question! He said (to paraphrase him) he leaves it up to the user to decide if the variations in the format are significant (i.e. no backwards compatibility in the software.)

I think for Mathematica, a single signature for the variations would be acceptable. The plain text nature of the format mitigates future archiving issues.

samalloing commented 4 years ago

Thanks for your response. I think I'm going to try to make different signatures then, because there are no tools (I'm aware of) that say which version a file is. So maybe it is useful to know this. But if you (or anybody else) has objections to this, I'll create one for all versions)

Sam

samalloing commented 4 years ago

I've been thinking further on this and I think I will make a signature file for every version of Notebook I have and create a fallback for Wolfram Notebooks in general so that all Notebooks will be identified. Most will have versions and some will have the general ID. So we don't need to deprecate the current Wolfram Notebook PUID and if more specific PUIDs are available these are assigned. How does this sound?

gleporeNARA commented 4 years ago

I like the idea of a catch-all signature (the existing one) and then individual ones. There must be some format changes from the earlier version to the later versions (after Wolfram bought Mathematica).

I can test your signatures when they are ready.

samalloing commented 4 years ago

Hi @gleporeNARA

I have a test signature for all the mathematica notebook files I have and that you included in this issue. The all.zip file contains the signature file for 10.0, 10.1, 10.2, 10.3, 10.4, 11.0, 11.2, 7.0, 8.0, 9.0, 11.1, 11.3, 12.0, 6.0 and a catch all for unknown versions. all.zip

cc @thorsted

gleporeNARA commented 4 years ago

The version specific signatures look good, they all match up with my test files (except for some of the 4.2 versions.) The generic signature is probably too brief at 2 bytes to be specific enough. It matched hundreds of non-Mathematica files in my test collections. See attached for a small sample. It mostly looks like Pascal code, but there are other formats that come up postive as well. I would suggest the generic signature should also include the word Mathematica somewhere in the first 200 or so characters, and perhaps a few more asterisks.

The others that aren't matching a specific signature all have the string "Mathematica-Compatible Notebook" in addition to the '(*' string. Perhaps a separate signature for that would be useful. There's obviously some program out there that outputs its Mathematica files with that string.

Thanks for working on this!

False Positives 2.zip

samalloing commented 4 years ago

Hi @gleporeNARA,

Can you also provide the examples of Mathematica 4.2 files that fail the match?

Thanks!

gleporeNARA commented 4 years ago

The four files with the names beginning with Math42 in the original zip file I uploaded. It's weird, because it looks like they should match the hex values 4372656174656442793d274d617468656d617469636120342e3227

Can you verify?

samalloing commented 4 years ago

Fix it, just a copy paste error. all.zip

samalloing commented 4 years ago

I'll look into the false positives

samalloing commented 4 years ago

Regarding the false positives, I think I found a solution. If the end of the file also match: 290a 0d0a 0d0a, 290a 0a0a or 290a then it is a Mathematica file. This matches 100% of the mathematica files and non of the false positives.

Now I'm trying to create a signature for that, but it is more challenging then I thought :-)