Format Identification Methodology and Programming

gleporeNARA commented 3 years ago

I thought it might be a good idea to document my particular approach to identifying format signatures, in the hope that it will help other people in their work. I use a combination of Python programs and Bash scripts to largely automate my analysis of new formats.

Many of us are working on unknown file formats from our own repositories, but in addition to local files, there are also tons of online repositories of formats, software, and documentation to draw upon. I download large datasets from various online archives (old computers, software, etc.) as well as from the Internet Archive.

Once I've downloaded my datasets, I run Siegfried against the files to identify the unknown files (I identify all files where Siegfried returns either UNKNOWN or "extension only" matches.) I use a Bash script I wrote called 'getid' to pull out all files of the same extension, and copy them to my working environment. The first thing I do when I enter that directory on the command line is to run another script called 'binhead', which dumps the first 16 binary bytes of all the files in the directory to the screen. It looks like this:

4745 4f49 4420 4558 5452 4143 5445 4420 GEOID EXTRACTED 4745 4f49 4420 4558 5452 4143 5445 4420 GEOID EXTRACTED 4745 4f49 4420 4558 5452 4143 5445 4420 GEOID EXTRACTED 4745 4f49 4420 4558 5452 4143 5445 4420 GEOID EXTRACTED 4745 4f49 4420 4558 5452 4143 5445 4420 GEOID EXTRACTED 4745 4f49 4420 4558 5452 4143 5445 4420 GEOID EXTRACTED 4745 4f49 4420 4558 5452 4143 5445 4420 GEOID EXTRACTED 4745 4f49 4420 4558 5452 4143 5445 4420 GEOID EXTRACTED 4745 4f49 4420 4558 5452 4143 5445 4420 GEOID EXTRACTED 4745 4f49 4420 4558 5452 4143 5445 4420 GEOID EXTRACTED

'binhead' has a companion program called 'bintail' which dumps the last 16 bytes of each file to the screen.

'binhead' takes one argument, which is the number of bytes to dump. It defaults to 16, but can take more. The above extract implies the addition of more plain text in the header, so my next step would be to run 'binhead' with more bytes. 'binhead 26' yields the following:

4745 4f49 4420 4558 5452 4143 5445 4420 5245 4749 4f4e 2020 2020 GEOID EXTRACTED REGION
4745 4f49 4420 4558 5452 4143 5445 4420 5245 4749 4f4e 2020 2020 GEOID EXTRACTED REGION
4745 4f49 4420 4558 5452 4143 5445 4420 5245 4749 4f4e 2020 2020 GEOID EXTRACTED REGION
4745 4f49 4420 4558 5452 4143 5445 4420 5245 4749 4f4e 2020 2020 GEOID EXTRACTED REGION
4745 4f49 4420 4558 5452 4143 5445 4420 5245 4749 4f4e 2020 2020 GEOID EXTRACTED REGION
4745 4f49 4420 4558 5452 4143 5445 4420 5245 4749 4f4e 2020 2020 GEOID EXTRACTED REGION
4745 4f49 4420 4558 5452 4143 5445 4420 5245 4749 4f4e 2020 2020 GEOID EXTRACTED REGION
4745 4f49 4420 4558 5452 4143 5445 4420 5245 4749 4f4e 2020 2020 GEOID EXTRACTED REGION
4745 4f49 4420 4558 5452 4143 5445 4420 5245 4749 4f4e 2020 2020 GEOID EXTRACTED REGION
4745 4f49 4420 4558 5452 4143 5445 4420 5245 4749 4f4e 2020 2020 GEOID EXTRACTED REGION

The 20202020 at the end are just spaces. I think that's enough to create a PRONOM signature for these files, so I run another program called 'lcs' on the files. LCS stands for the Longest Common Subsequence, which is a programming term for an algorithm that identifies the longest common string of characters which occurs in all strings analyzed. Calling the program - 'lcs geo 22' will identify the longest common subsequence in the first 22 bytes of all files with a geo extension. The LCS algorithm is incredibly processor intensive, so it's impossible to run it on strings larger than roughly 1024 bytes. Running 'lcs' produces the hex values "47454f49442045585452414354454420524547494f4e", which I then use to create a new PRONOM signature, using Ross Spencer's online tool.

Once I download the new signature XML file. I run yet another Bash script 'sfxx'. The script copies the XML file to my Siegfried directory, creates the new signature, and runs it against the files in the current directory.

I have quite a few other Python programs and Bash scripts which automate various parts of working with file formats. Another Python script identifies which specific bytes (in hex) occur at the same location across all files in a directory. This can help identify format signatures which are not text based. I still have work to do, I would like to adapt the above program to also search for common bytes starting from the end of the file as well (which would help identify EOF markers.)

I'm not a great programming, I mostly hack and slash at the code until it roughly works, but the above process has greatly increased my ability to analyze unknown formats. I would be interested to know if other researchers are writing their own code to help analyze file formats.

Please reply below with any questions or comments.

gleporeNARA commented 3 years ago

For those interested in the Bash scripts, here they are, from my .bashrc, which connects the alias command with the full command.

alias stringme='strings --radix=d -n 4 * | sort | uniq -c| sort -gr | head' - prints out common strings from all files in the directory.

function sfxx () { sudo cp ls *.xml /usr/share/siegfried/custom && sudo roy build -extend ls *.xml "$1".sig && sf -csv -sig "$1".sig * }

copies XML file to Siegfried's directory, builds the new sig file, and runs it against files in the current directory. Takes an argument, which is the name of the sig file.

function binhead () { hexnum="$1" if [ -z "$1" ] then head -c 16 -q | xxd -c 16 else head -c "$hexnum" -q | xxd -c "$hexnum" fi }

dumps first 16 (or argument) bytes for all files in directory.

function bintail () { hexnum="$1" if [ -z "$1" ] then tail -c 16 -q | xxd -c 16 else tail -c "$hexnum" -q | xxd -c "$hexnum" fi }

does the same for the last bytes of all files.

function getid () { mkdir /working/$1 find . -iname "*.$1" -exec cp '{}' /working/"$1"/ \; }

takes one argument, the extension of the files to be copied, copies those files to working directory.

The 'lcs' program is available at:

https://github.com/gleporeNARA/pronom-research/blob/master/lcs

as is a program called 'idcom' which runs 'file', Siegfried, and TRiD against all files in a directory.

archivist-liz commented 3 years ago

Thanks for sharing this! I would suggest writing this up as a blog post because I think that not many people will read the issues in the repository. I'll tweet this out, but I'm definitely not enough of a coder to provide feedback. I'll be happy if I can follow the steps well enough to try it out at home. ;)

digital-preservation / pronom-research-week

Format Identification Methodology and Programming #23