Open Nadahar opened 7 years ago
The TIFF file you gave starts with:
We currently observe the first bytes are 49 49 2A 00 08 00
and decide it's ARW.
ARW files and your TIFF diverge at offset 8 (ARW has 0x12, your TIFF has 0x16), however not all the TIFF files we have on file agree on what occurs upon divergence.
This will require further research, either finding some official documentation or a greater number of ARW/TIFF images.
It seems there is a good deal of information on the TIFF format. The following may be the most useful:
The header is only 8 bytes in length:
typedef struct _TiffHeader
{
WORD Identifier; /* Byte-order Identifier */
WORD Version; /* TIFF version number (always 2Ah) */
DWORD IFDOffset; /* Offset of the first Image File Directory*/
} TIFHEAD;
The identifier is either 49 49
(II) or 4D 4D
(MM), little-endian (Intel) and big-endian (Motorola) respectively. Making the first four bytes either 49 49 2A 00
or 4D 4D 00 2A
. The offset is 32-bit and the 08 00 00 00
means that the IFD immediately follows the header.
So at offset 8 the IFD starts which looks as follows:
typedef struct _TifIfd
{
WORD NumDirEntries; /* Number of Tags in IFD */
TIFTAG TagList[]; /* Array of Tags */
DWORD NextIFDOffset; /* Offset to next IFD */
} TIFIFD;
So when at offset 8 there is either 0x12 or 0x16, it only gives information on how many entries are in the IFD.
When Sony later created the ARW format they based it on the TIFF format, as have many others. The correct interpretation of the byte sequence is that it pertains to a TIFF, not an ARW.
The ARW differs from TIFF mainly in a special Maker note as part of the EXIF data and always has the little-endian byte ordering, 1 subIFD, and the full resolution image. More on this here: http://lclevy.free.fr/raw/arw.txt
Thanks @blauwers, indeed, after 8 bytes there's no guarantee of stability. From opening bytes there's no way to differentiate ARW from regular TIFF. ARW will probably have to be removed from the file type enumeration.
I'm leaning towards differentiating between container format, and content format. There are a bunch of TIFF-contained content formats (most of the camera raw formats). RIFF is another container that's shared between eg. WebP and WAV.
Exiftool has a very long, sometimes make/model-specific process for determining filetype. The library could probably do something similar, although it will take a lot of research. Maybe add it to either another properties Directory or expand FileMetadataDirectory?
@drewnoakes I agree that, for now, calling all of these TIFF would be the correct way to do it. The last link from my previous reply has a good list of derivative formats which need more elaborate detection mechanisms to detect the sub-type.
Most of the subtype divergence encodes in the IFDs and Makernotes, which is where the non-EXIF metadata for the TIFF format gets stored. Parsing these makes sense, and I get your point on the effort and research needed being non-trivial. Fortunately, there may be an opportunity to leverage some of the logic from ExifTool to minimize the research as @kwhopper suggested.
Here an other TIFF sample recognized as ARW: ( LibFormat - 49 49 2a 00 08 00 00 00 0f ) colored3.zip
EDIT: This sample was converted to TIFF with XnView from the original file: https://d1a0n9gptf7ayu.cloudfront.net/cache/2d/ed/2ded9254c55cb67dcaa25a070c62b6be.jpg
Some more, open sourced, samples taken from there: dscf0013.zip (Fujifilm - 49 49 2a 00 08 00 00 00 15 ) pc260001.zip (Olympus - 49 49 2a 00 08 00 00 00 14 )
If you need more ARW samples, this link could help: https://www.rawsamples.ch/index.php/en/sony
Thanks. This is definitely a bug and will be fixed, even if to the detriment of ARW handling, in the next release.
Not sure if it can help, but these guys did some searches already and provide interesting informations and links: https://github.com/haraldk/TwelveMonkeys/issues/136
I'm also affected by this bug. @drewnoakes, when do you plan next release?
I've notices that a "suspicious amount" of TIFFs are detected as ARW with
FileTypeDetector.detectFileType()
. I took a screenshot, pasted it in PS and and saved it as a TIFF so that it was "created from scratch" in PS. It is detected as ARW.I'm attaching the screenshot, but this should be easy to replicate. ScreenShot.zip