drewnoakes / metadata-extractor

Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files
Apache License 2.0
2.59k stars 484 forks source link

FileTypeDetector.detectFileType() recognizes Photoshop created TIFFs as ARW #217

Open Nadahar opened 7 years ago

Nadahar commented 7 years ago

I've notices that a "suspicious amount" of TIFFs are detected as ARW with FileTypeDetector.detectFileType(). I took a screenshot, pasted it in PS and and saved it as a TIFF so that it was "created from scratch" in PS. It is detected as ARW.

I'm attaching the screenshot, but this should be easy to replicate. ScreenShot.zip

drewnoakes commented 7 years ago

The TIFF file you gave starts with:

image

We currently observe the first bytes are 49 49 2A 00 08 00 and decide it's ARW.

ARW files and your TIFF diverge at offset 8 (ARW has 0x12, your TIFF has 0x16), however not all the TIFF files we have on file agree on what occurs upon divergence.

This will require further research, either finding some official documentation or a greater number of ARW/TIFF images.

blauwers commented 7 years ago

It seems there is a good deal of information on the TIFF format. The following may be the most useful:

The header is only 8 bytes in length:

typedef struct _TiffHeader
{
    WORD  Identifier;  /* Byte-order Identifier */
    WORD  Version;     /* TIFF version number (always 2Ah) */
    DWORD IFDOffset;   /* Offset of the first Image File Directory*/
} TIFHEAD;

The identifier is either 49 49 (II) or 4D 4D (MM), little-endian (Intel) and big-endian (Motorola) respectively. Making the first four bytes either 49 49 2A 00 or 4D 4D 00 2A. The offset is 32-bit and the 08 00 00 00 means that the IFD immediately follows the header.

So at offset 8 the IFD starts which looks as follows:

typedef struct _TifIfd
{
    WORD    NumDirEntries;    /* Number of Tags in IFD  */
    TIFTAG  TagList[];        /* Array of Tags  */
    DWORD   NextIFDOffset;    /* Offset to next IFD  */
} TIFIFD;

So when at offset 8 there is either 0x12 or 0x16, it only gives information on how many entries are in the IFD.

When Sony later created the ARW format they based it on the TIFF format, as have many others. The correct interpretation of the byte sequence is that it pertains to a TIFF, not an ARW.

The ARW differs from TIFF mainly in a special Maker note as part of the EXIF data and always has the little-endian byte ordering, 1 subIFD, and the full resolution image. More on this here: http://lclevy.free.fr/raw/arw.txt

drewnoakes commented 7 years ago

Thanks @blauwers, indeed, after 8 bytes there's no guarantee of stability. From opening bytes there's no way to differentiate ARW from regular TIFF. ARW will probably have to be removed from the file type enumeration.

I'm leaning towards differentiating between container format, and content format. There are a bunch of TIFF-contained content formats (most of the camera raw formats). RIFF is another container that's shared between eg. WebP and WAV.

kwhopper commented 7 years ago

Exiftool has a very long, sometimes make/model-specific process for determining filetype. The library could probably do something similar, although it will take a lot of research. Maybe add it to either another properties Directory or expand FileMetadataDirectory?

blauwers commented 7 years ago

@drewnoakes I agree that, for now, calling all of these TIFF would be the correct way to do it. The last link from my previous reply has a good list of derivative formats which need more elaborate detection mechanisms to detect the sub-type.

Most of the subtype divergence encodes in the IFDs and Makernotes, which is where the non-EXIF metadata for the TIFF format gets stored. Parsing these makes sense, and I get your point on the effort and research needed being non-trivial. Fortunately, there may be an opportunity to leverage some of the logic from ExifTool to minimize the research as @kwhopper suggested.

Sami32 commented 7 years ago

Here an other TIFF sample recognized as ARW: ( LibFormat - 49 49 2a 00 08 00 00 00 0f ) colored3.zip

EDIT: This sample was converted to TIFF with XnView from the original file: https://d1a0n9gptf7ayu.cloudfront.net/cache/2d/ed/2ded9254c55cb67dcaa25a070c62b6be.jpg

Some more, open sourced, samples taken from there: dscf0013.zip (Fujifilm - 49 49 2a 00 08 00 00 00 15 ) pc260001.zip (Olympus - 49 49 2a 00 08 00 00 00 14 )

If you need more ARW samples, this link could help: https://www.rawsamples.ch/index.php/en/sony

drewnoakes commented 7 years ago

Thanks. This is definitely a bug and will be fixed, even if to the detriment of ARW handling, in the next release.

Sami32 commented 7 years ago

Not sure if it can help, but these guys did some searches already and provide interesting informations and links: https://github.com/haraldk/TwelveMonkeys/issues/136

eximius313 commented 6 years ago

I'm also affected by this bug. @drewnoakes, when do you plan next release?