dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.34k stars 297 forks source link

Support image custom metadata #1410

Open VRciF opened 2 years ago

VRciF commented 2 years ago

Describe the bug

As requested by @dadoonet here I open an issue with my question about being unable to index custom metadata.

My goal is to add/modify file metadata tags and have fscrawler index them.
For example I think about perceptual hashes (computed by https://github.com/lunardog/imagehash-cli) or faces recognized in an image including coordinates and name of the multiple persons and such.

To create a custom tag, one need to provide a config file (e.g. named exif.hash.conf) to exiftool describing the tag key's like so:

%Image::ExifTool::UserDefined = (
    'Image::ExifTool::XMP::xmp' => {
        ImageHashAverage => { Writable => 'string' },
        ImageHashPerception => { Writable => 'string' },
        ImageHashDifference => { Writable => 'string' },
        ImageHashWavelet => { Writable => 'string' },
        Test => { Writable => 'string' },
    },
);

1;

To add metadata you use exiftool -config exif.hash.conf -xmp-xmp:Test=somesteststring test.jpg

Below is an example test.jpg image which contains the following additional xmp tags:

Image Hash Average              : 0000000000000000
Image Hash Difference           : 0000000000000000
Image Hash Perception           : 8000000000000000
Image Hash Wavelet              : 0000000000000000
Test                            : somesteststring

Note: The hashes are weird, because it is just a single color test image.

The tags can be listed with the command exiftool test.jpg.

If I run fscrawler with fscrawler --loop 1 --config_dir /jobs/images/config --restart images only the most common tags are indexed, but not my custom metadata.

Job Settings

name: "images"
fs:
  url: "/tmp/es/images"
  update_rate: "15m"
  includes:
    - "*/*.jpg"
    - "*/*.png"
    - "*/*.tif"
    - "*/*.tiff"
  #excludes:
  #  - "*/resume*"
  #filters:
  #  - ".*foo.*"
  # error from fscrawler: cannot support both json and xml
  #json_support: true
  #xml_support: true
  add_as_inner_object: true
  #index_folders: false
  attributes_support: true
  raw_metadata: true
  #add_filesize: false
  remove_deleted: true
  continue_on_error: true
  lang_detect: true
  store_source: false
  #indexed_chars: "100000"
  #ignore_above: "512mb"
   # required
  index_content: true
  #indexed_chars: 0
  checksum: "SHA-1"
  #follow_symlink: true

elasticsearch:
  nodes:
    - url: "http://elasticsearch:9200"
  index: "images"
  index_folder: "images"

Logs

No special logs available. The requested metadata is just not indexed. No error, warning or anything like that. Basically just ignored.

Expected behavior

The mentioned custom metadata should be indexed by fscrawler at least in meta.raw.

Versions:

Attachment

test

dadoonet commented 4 months ago

I'm very sorry that I missed your issue... w00t! 2 years later...

Can this be changed somehow? E.g. provide some custom config to fscrawler (or Tika) describing the custom tags?

I think that's actually an issue on Tika side. But may be this has been fixed in the meantime. I'm not sure if you are still using FSCrawler but if you do, could you test it again with the latest 2.10-SNAPSHOT build?