cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
158 stars 34 forks source link

2024-06-25 sndhdr update and HD/CD/DVD Image files #87

Closed NebularNerd closed 1 month ago

NebularNerd commented 2 months ago

Should close #85

SNDHDR Parity update (and HD/CD/DVD Image files)

.aif/.aiff/.aiffc/.8svx:

These are IFF based files, all start with 0x464f524d/FORM AIFF files are a hodgepodge of formats and specs all thrown under the same label, different compression styles or similar compression styles with the wrong FourCC can render a file unplayable on certain software.

I've updated/tidied the database to recognise the additional AIFF or AIFC header at byte 8. With possible enhancements under V2 we could perform further matches to detail compression used and possible even bitrates etc...

.au:

The existing fingerprint should match all files No changes, we could extract more info but looking at how sndhdr does it I'll leave that for a V2 upgrade

.hcom:

There exists almost no information on the format, what there is, is basically the same data as linked below in differing formats. From what I can see it's some old Apple Mac format possibly used in apps and games.

The sndhdr test looks for two headers, one is in the Mac header, the other in the Mac data fork. For the time being I have added them as two separate tests, this will give a low-ish confidence score, however, in the absence of test files there is little more I can do.

If anyone ever reads this and has some sample files, I'll take a look to improve this match.

.sndt:

After a lot of digging I found this format seems to belong to a very old Win 3.1 era program called SoundTool/SNDTOOL, I managed to source a copy buried in a shareware .iso at archive.org. Downloading it and comparing a sample file included to the ones below seems to indicate this is the source of these files.

.voc/.wav:

No changes required, existing fingerprint will match any VOC/WAV file. V2 Improvements could look to decode audio data for sample rate etc...

.sb/.ub/.ulaw:

Cannot add, .sb and .ub are intended to be signed or unsigned byte-streams as far as I can guess the intentions of the sndhdr authors. This means they are simply a stream of bytes that hold audio data, knowledge of the correct bitrate etc.. then decodes them back to audio.

.ulaw is essentially a CODEC used in various audio containers such as AIFF and AU, this again means there is no specific ulaw file format.

In these cases, there is not a lot we can do to detect these files. It would basically require creating an audio decoder similar to sndhdr or Audacity, VLC etc. to fully process and try to understand these files. This could be possible with V2 but this would take on a life of its own.

.sndr:

I have no idea on this, I've added the header match from sndhdr but again without test files or knowledge of the program they came from we can't go any better than that.

Again, if anyone reading this has any test files of the program that made it, I'll take a look and improve.

Other formats:

Honestly this PR is a bit of a so so one, so let's add some extras stuff to make it more exciting.

.vhdx:

The updated version of the older .vhd format used by Microsoft Hyper-V and Virtual PC, nice simple header of 0x7668647866696c65 / vhdxfile.

.qcow/.qcow2/.qed:

QEMU's Hard drive image formats. Simple headers with version numbers

.luks

Linux Unified Key Setup is another HD Image format, there are two versions LUKS1 and LUKS2.

.vdi

Sun/Oracle HD Image for use with VirtualBox. Nice long headers to match against. There is no official document on the format it seems but a good breakdown is available, linked below.

As far as I can see there is only one version (1.1) with the same image signature starting at byte 64 for both flavours, I've included it as a multi-part for completeness.

.vmdk

There are already entries in the .json for VMWare .vmdk files, I have tidied and adjusted some to better match real world files

.dmg

The venerable archive format of Mac OS machines, the existing entry would only ever work for the file it came from. The correct way to identify a .dmg is to use a footer match at -512 for koly.

["7801730d626260", 0, ".dmg", "application/octet-stream", "MacOS X image file"] and ["", 0, ".dmg", "application/octet-stream", "MacOS X image file"] removed, new entry in footer added

OK, Even more formats

I'll note here the CD/DVD images are a real pain in the backside, lots of overlapping headers and proprietary info. This is a good start for later V2 fun.

MagicISO Image Format .uif

A seemingly much hated proprietary format for storing images of CD/DVD's. Can't find any test files or documentation, however, there is UIF2ISO which converts the files to regular ISO. Digging in the source seems to show a header at byte 0 of 0x73696262 / sibb with another match at byte 8 of 0x72686c62 / rhlb if it's encrypted.

If I ever come across a real file to test against I'll confirm this but the code has been around a long time so it's pretty safe to assume it's correct.

PowerISO Direct Access Archive .daa

Another proprietary format for storing images of CD/DVD's, much like .uif it's also pretty unpopular. The author of UIF2ISO also created a tool to deal with them called DAA2ISO. Simple header of 0x444141 / DAA at byte 0

gBurner Image .gbi

Another proprietary format for storing images of CD/DVD's, it appears to be quite similar to .daa as DAA2ISO handles both. Simple header of 0x474249 / GBI at byte 0

Apple HyperCard Stack .hc

While I was looking for data on another .hc extension, HyperCards popped up, so we'll add them in while we're here. HyperCards were almost a pre-cursor to web pages, able to store text and images in a clickable, searchable database. Header of 0x5354414b / STAK at byte 4

VeraCrypt File Container .hc

An encrypted image container, we can only add this as an extension as the VERA header at byte 64 and all data following is encrypted by the 64 byte salt.

Nero Disc images *.nrg

Nero was once one of the most popular CD/DVD burning tools, the .nrg was their own custom image format. These use Footer matches for the two versions 0x4e45524f / NERO at -8 and 0x4e455235 / NER5 at -12 for v1 and v2 images.

Compressed ISO images .isz

Created by EZB Systems for use in their various products, this is an open specification for producing ZLIB compressed version of ISO images. Header is 0x49735a21 / IsZ! at byte 0

DiscJuggler images .cdi

Padus DiscJuggler was a professional mastering solution for CD and DVD. Due to their .cdi image format being highly flexible, it got adopted as the de-facto format for archiving Dreamcast games. There appear to be a few versions. Adding as an extension only, looking at the source for cdi2nero it's a complex format that would need a partial port of that app to understand them, looking at libMirage confirms this idea.

CloneCD Control File .ccd, Image .img and Subchannel Info .sub

CloneCD is another powerful CD/DVD image tool. The .ccd contains various metadata relating to the .img file. Official specs on the format are non-existent it seems, I've inferred the matches from samples from a range of sources. Much like .cdi above some form of decoding may be the way to go in the future, looking at libMirage confirms this idea.

BlindWrite images .b5t / .b6t and BlindRead images .bwt

BlindWrite and it predecessor BlindRead are another set of CD/DVD Imaging tools. Much like CloneCd they can produces various files to preserve important onformation about the source disk. Most of these will be extension only for the time being as I lack sample files and cannot find much about the format.

WinOnCD images .c2d

While browsing the libMirage source for other formats, this one was in the list. This was an early entry into the CD mastering market, it changed hands a couple of times from Roxio to Adaptec. Two headers 0x4164617074656320436551756164726174205669727475616c43442046696c65 / Adaptec CeQuadrat VirtualCD File and 526f78696f20496d6167652046696c6520466f726d617420332e30 / Roxio Image File Format 3.0

Adaptec Easy CD/DVD Creator image file .cif

Another CD/DVD creator software purchased by Adaptec from Corel, header info from libMirage. This use a RIFF header then at byte 8 0x696d6167 / imag. There are earlier versions of the format that used .cl2, .cl3 and .cl4 but there is no info on these formats beyond that, will add as extension only until samples files are found.

Alcohol 120% image file .mds and GameJack image file .xmd

Another powerful CD/DVD image creator, like BlindWrite and CloneCD it can make near perfect copies of most discs. There's not much info on GameJack, it's either a licensed or questionable clone of Alcohol.

Daemon Tools image file .mdx

Pretty much one of the most popular virtual drive tools, it's been around for a very long time.

Apple Toast File .toast

Toast is a early CD burning software package for Macs, it's changed hands many times of the years. Early toast files have a header of 45520200 / ER. Later toast files are simply .iso with a different name.

Links:

cdgriffith commented 1 month ago

Thank you for all these additions!