KBNLresearch / omSipCreator

Create ingest-ready SIPs from batches of optical media images
Apache License 2.0
7 stars 0 forks source link

Establish MIME type using signature-based identification (e.g. Fido) + check if mimetype is as expected #23

Closed bitsgalore closed 7 years ago

bitsgalore commented 7 years ago

Perhaps FIDO could be used for this?

bitsgalore commented 7 years ago

As for FIDO integration: apparently FIDO still has no API, so integrating it into other Python applications looks like a bit of a mess:

http://wiki.opf-labs.org/display/KB/FIDO+Python+workflow+implementation+tips

Then also FIDO still has issues under Python 3:

https://github.com/openpreserve/fido/issues/79

So not going there for now.

Tika-python also doesn't work properly under Python 3:

from tika import detector
print(detector.from_file('/path/to/file'))

Results in:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 6: invalid start byte

There are some Python modules that wrap around libmagic (e.g. python-magic and filemagic) , but they require libmagic binaries to be installed separately which is a bit of a pain on Windows.

Other possibility: write custom detection function for only WAV, FLAC and ISO based on the Tika signatures:

WAV:

<mime-type type="audio/x-wav">
    <acronym>WAV</acronym>
    <magic priority="20">
      <match value="RIFF....WAVE" type="string" offset="0"
             mask="0xFFFFFFFF00000000FFFFFFFF"/>
    </magic>
    <glob pattern="*.wav"/>
</mime-type>

FLAC:

<mime-type type="audio/x-flac">
    <acronym>FLAC</acronym>
    <_comment>Free Lossless Audio Codec</_comment>
    <magic priority="50">
      <match value="fLaC" type="string" offset="0"/>
    </magic>
    <glob pattern="*.flac"/>
</mime-type>

ISO 9660:

<mime-type type="application/x-iso9660-image">
    <acronym>ISO</acronym>
    <_comment>ISO 9660 CD-ROM filesystem data</_comment>
    <magic priority="50">
      <match value="CD001" type="string" offset="32769"/>
      <match value="CD001" type="string" offset="34817"/>
      <match value="CD001" type="string" offset="36865"/>
    </magic>
    <glob pattern="*.iso"/>
</mime-type>
bitsgalore commented 7 years ago

Changed to WONTFIX b/c adding identification would requires wrapping of external tool which seems a bit overkill. Assignment based on file extension.