Closed bitsgalore closed 7 years ago
As for FIDO integration: apparently FIDO still has no API, so integrating it into other Python applications looks like a bit of a mess:
http://wiki.opf-labs.org/display/KB/FIDO+Python+workflow+implementation+tips
Then also FIDO still has issues under Python 3:
https://github.com/openpreserve/fido/issues/79
So not going there for now.
Tika-python also doesn't work properly under Python 3:
from tika import detector
print(detector.from_file('/path/to/file'))
Results in:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 6: invalid start byte
There are some Python modules that wrap around libmagic (e.g. python-magic and filemagic) , but they require libmagic binaries to be installed separately which is a bit of a pain on Windows.
Other possibility: write custom detection function for only WAV, FLAC and ISO based on the Tika signatures:
WAV:
<mime-type type="audio/x-wav">
<acronym>WAV</acronym>
<magic priority="20">
<match value="RIFF....WAVE" type="string" offset="0"
mask="0xFFFFFFFF00000000FFFFFFFF"/>
</magic>
<glob pattern="*.wav"/>
</mime-type>
FLAC:
<mime-type type="audio/x-flac">
<acronym>FLAC</acronym>
<_comment>Free Lossless Audio Codec</_comment>
<magic priority="50">
<match value="fLaC" type="string" offset="0"/>
</magic>
<glob pattern="*.flac"/>
</mime-type>
ISO 9660:
<mime-type type="application/x-iso9660-image">
<acronym>ISO</acronym>
<_comment>ISO 9660 CD-ROM filesystem data</_comment>
<magic priority="50">
<match value="CD001" type="string" offset="32769"/>
<match value="CD001" type="string" offset="34817"/>
<match value="CD001" type="string" offset="36865"/>
</magic>
<glob pattern="*.iso"/>
</mime-type>
Changed to WONTFIX b/c adding identification would requires wrapping of external tool which seems a bit overkill. Assignment based on file extension.
Perhaps FIDO could be used for this?