Open gleporeNARA opened 4 years ago
I'm going to try this one if that's ok
Sam - absolutely! Have at it. I have a large body of files to test your signature on.
I have created a signature file for a notebook file. The current problem I'm facing (also described above) is that there are multiple versions of Wolfram Notebook and there is only one fmt. Should I create a signature file for every version I can find? Or do I need to handle this differently? cc @Dclipsham
Thanks for any advise
Sam
Sam I've asked David that very same question! He said (to paraphrase him) he leaves it up to the user to decide if the variations in the format are significant (i.e. no backwards compatibility in the software.)
I think for Mathematica, a single signature for the variations would be acceptable. The plain text nature of the format mitigates future archiving issues.
Thanks for your response. I think I'm going to try to make different signatures then, because there are no tools (I'm aware of) that say which version a file is. So maybe it is useful to know this. But if you (or anybody else) has objections to this, I'll create one for all versions)
Sam
I've been thinking further on this and I think I will make a signature file for every version of Notebook I have and create a fallback for Wolfram Notebooks in general so that all Notebooks will be identified. Most will have versions and some will have the general ID. So we don't need to deprecate the current Wolfram Notebook PUID and if more specific PUIDs are available these are assigned. How does this sound?
I like the idea of a catch-all signature (the existing one) and then individual ones. There must be some format changes from the earlier version to the later versions (after Wolfram bought Mathematica).
I can test your signatures when they are ready.
Hi @gleporeNARA
I have a test signature for all the mathematica notebook files I have and that you included in this issue. The all.zip file contains the signature file for 10.0, 10.1, 10.2, 10.3, 10.4, 11.0, 11.2, 7.0, 8.0, 9.0, 11.1, 11.3, 12.0, 6.0 and a catch all for unknown versions. all.zip
cc @thorsted
The version specific signatures look good, they all match up with my test files (except for some of the 4.2 versions.) The generic signature is probably too brief at 2 bytes to be specific enough. It matched hundreds of non-Mathematica files in my test collections. See attached for a small sample. It mostly looks like Pascal code, but there are other formats that come up postive as well. I would suggest the generic signature should also include the word Mathematica somewhere in the first 200 or so characters, and perhaps a few more asterisks.
The others that aren't matching a specific signature all have the string "Mathematica-Compatible Notebook" in addition to the '(*' string. Perhaps a separate signature for that would be useful. There's obviously some program out there that outputs its Mathematica files with that string.
Thanks for working on this!
False Positives 2.zip
Hi @gleporeNARA,
Can you also provide the examples of Mathematica 4.2 files that fail the match?
Thanks!
The four files with the names beginning with Math42 in the original zip file I uploaded. It's weird, because it looks like they should match the hex values 4372656174656442793d274d617468656d617469636120342e3227
Can you verify?
Fix it, just a copy paste error. all.zip
I'll look into the false positives
Regarding the false positives, I think I found a solution. If the end of the file also match: 290a 0d0a 0d0a
, 290a 0a0a
or 290a
then it is a Mathematica file. This matches 100% of the mathematica files and non of the false positives.
Now I'm trying to create a signature for that, but it is more challenging then I thought :-)
Looking for additional real-world examples from earlier version of Mathematica (say, versions 1 and 2, if they existed). Also wondering about the resource cost in performing an identification on a signature with several long text strings to search for.
Format name: Mathematica Notebook files Version number(s): all? PRONOM fmt/201 - No current signatures on file - http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=926&strPageToDisplay=summary Extensions: nb mime-type: text/plain; charset=us-ascii Description: "Wolfram Mathematica (usually termed Mathematica) is a modern technical computing system spanning most areas of technical computing — including neural networks, machine learning, image processing, geometry, data science, visualizations, and others. The system is used in many technical, scientific, engineering, mathematical, and computing fields. It was conceived by Stephen Wolfram and is developed by Wolfram Research of Champaign, Illinois The Wolfram Language is the programming language used in Mathematica." Format type: Text (Structured) Vendor: Wolfram Research
The signature from the 'file' command is:
Below is a list of common strings that appear in these files.
Content-type: application/vnd.wolfram.mathematica Content-type: application/mathematica Wolfram Notebook File Mathematica-Compatible Notebook CreatedBy='Mathematica x.x' http://www.wolfram.com/nb mathematica.zip