IIIF / api

Source for API and model specifications documents (api and model)
http://iiif.io/api
105 stars 54 forks source link

Non-Textual content Search with ML/AI as one use case #2265

Open DiegoPino opened 9 months ago

DiegoPino commented 9 months ago

Motivation

Being able to use the IIIF Content Search API 3.x or 4.x to do search based on other than textual/plain text input.

With the conversations and implementations around AI/ML starting to rapidly permeate into IIIF, it would be great to have a Content Search API (some future version) able to deal with e.g (Dense) Vector based search for Image Similarity, or Audio/Moving media and inclusive for Textual search ("What is a medieval Manuscript?" as query). This comes from a conversation that took place during the IIIF Search TSG call on October 3, 2023 (thanks for the invite!) and motivated by some discussions that happened during the first IIIF AI + Machine Learning Community Group meeting the week before. This has multiple applications but also a lot of challenges, so can also be seen as a wishful request that could be explored as an extension to the specs (first/forever)

Ideas and Background

(Dense) Vector based search is not new but in recent years deploying the infrastructure to do this has become easier, with stable/production ready implementations in backend systems like Solr (9+), Elasticsearch and Vespa.ai but also vector specific ones like Annoy. As you all know Current Content Search API (2.0) assumes search happens through a textual input (query) and returns Annotations. Vector search implies An Actual vector as input that is parsed/processed through a similarity approach/compared (Cosine, Dot, Euclidean, etc) against indexed vectors on the backed and returns a list of matches with "similarity" rankings which will lead to also speak about sorting under these need. These (Dense) vectors can have different dimensions/"meanings"/contexts and thus are very specific to the implementations/decisions made during feature extraction (what happens during process via ML/AI before ending in a Search index) and also in the way (Model used, etc) these Vectors were produced and against what (Text, Image, Video, Sound, etc) they were generated. And also Input v/s Searched against Vectors need to match in approach, e.g a Model specialized in furniture detection will generate very different features/vectors than one that deals well with faces and thus a search of one against the other might not be the best, even if the Neural Network is very deep. Also dimensions might vary!

(A) how (Without passing vectors around)

The fact that IIIF Manifests (Or IIIF Image API responses) does not hold Vector data per se (also even if that would be possible it would be base64 encoded and HUGE and not really the place given the many-to-one production nature and obscured dependency of the implementations), the proposed "how" i want to suggest is to make this new feature as backend agnostic as possible and instead of touching every API to adapt to the use case, re-use what IIIF already provides in the current versions/specs, to allow a "reference" to media/fragment to be used as Query Input for Content Search. Basically your query input would be a complete WC3 complaint Web Annotation as used in IIIF and defined in https://www.w3.org/TR/annotation-model/

In other words and with an example: I want to search for "Similar images to a given face", the Search API takes as query argument a W3C Web Annotation denoting the fragment of the image/canvas that is going to be used as input and any other additional data contained in the annotation (which could also be implementation specific hints/support specific implementations not currently used in a typical "commenting" motivation based annotation). And the response to this would be a list of Annotations pointing to Canvases, Multiple Manifests (if a non specific resource Query is made). In fact a response would be very similar to the current responses for textual data queries, maybe just with some extra "metadata" like Similarity Score, etc. The same could be done to search for a time based media fragment (first 5 seconds of a piece of audio).

The implementation details on how the "input" Annotation targeting a squared around a face in a certain Canvas/or Specific resource (a IIIF image directly) can be left out of this which is good! Common ways of doing Vector based search in production are allowing users to Upload a reference (the query) image or selecting from an existing Resource a segment and searching against the rest of resources. The "Vectorization" of the input would be done as non IIIF exposed process on the backend. This also implies vectors are not moving around via IIIF. The backed processes the Annotation sent as input and extracts a/multiple vectors the way "it knows it can" and compares against the exiting vectors returning similar images "the way it can" based on its capabilities.

Other positive consequences of this are that you could even have "non ML" based processing/responses of the same input, comparing e.g similar Metadata, colors, hues etc found in an Annotation at any level (Who said Geographic query, you pass a polygon!) or even just the Annotation Body (Textual) being compatible with the current "loose" text used in Content Search API 2.0 and, that, basically, we would be using existing Structures of IIIF as query/return instead of extending our definition of Content.

Sorting?

Sorting is absent right now from the 2.0 specs. And for this use case and also for others as stated here https://github.com/IIIF/api/issues/507 it could be a really good plus. Many AI/ML Vector based search applications basically can return everything and sorting is important. (e.g I want the less similar to this?) but it also adds many burdens to the current implementation so I will leave this as an extra to be considered but that could be also an "invisible to the user, same as the actual vectors" decision a backend would do.

Note: @glenrobson @kirschbombe ping! I can not add labels in this repo (sorry) so would love your help with that. I hope I explained the use cases/needs here clearly enough. Happy to edit/add more info if needed. Thanks