dbmdz / solr-ocrhighlighting

Highlighting various OCR formats directly in Solr
https://dbmdz.github.io/solr-ocrhighlighting
MIT License
84 stars 13 forks source link

OCR and spatial search #70

Open legsak1mbo opened 4 years ago

legsak1mbo commented 4 years ago

More of a feature request than an issue but it would be incredibly useful if the HOCR data could be used for querying as well as highlighting. For example searching for a word within a specific region of the document by its page and/or coordinates.

jbaiter commented 4 years ago

Doing this "properly" for arbitrary regions is out of scope for this specific plugin I'm afraid, since it does not store any information about the actual coordinates in the index and thus can't query for it (e.g. like solr's Spatial Search).

One hacky way to go about this would be to add a filterBbbox parameter that is then checked at highlighting time against the bounding boxes in the OCR file. All snippets falling outside of the queried bounding box would be filtered out. This shouldn't be too hard to implement, since we have access to the bounding box information at highlighting time and can thus filter very easily based on it. This could be a good issue for a pull request for a new developer :-)

There is however currently support for filtering by a specific page in a document, check out the hl.ocr.pageId parameter in the documentation. Combined with a fq on the document id it allows you to limit the snippet generation to a single page in a single document. We use this to implement the IIIF Content Search API, which requires searching in a single page of a document.

legsak1mbo commented 4 years ago

Sorry I've been so long coming back to this. The ideal would be if we could search for "the first instance of a term after the previous" and/or "a term X & Y away from an anchor term" where the anchor would be something like a chapter title. Java isn't really my forte but I'll certainly look into it.

jbaiter commented 4 years ago

If you want to implement search inside of chapters, you could just index your documents at the chapter-level by creating source pointers that point to the markup for that chapter, this is described in the documentation here: https://dbmdz.github.io/solr-ocrhighlighting/indexing/#one-or-more-partial-files-per-solr-document.

Otherwise this is hard to implement with Lucene/Solr and the plugin in its current form, you could try sloppy phrase queries like "<chapter_word> <term>"~20, which would yield all spans where <term> appears within 20 token-positions of <chapter_word>, but this will also include cases where the term appears before the chapter.

I'm not sure if the approach proposed in my first response is going to work for you, since you'd need to know the specific region on a given page where a match is allowed to occur. This could be useful for a feature like "search only in headers/footers" (if those headers/footers appear in the same positions every time), but that is not your use case if I understood you correctly?

legsak1mbo commented 4 years ago

What I'm thinking is something like an old census form where the scans are all slightly wonky. The idea would be that you could use something like "Name" as an anchor and search for the first instance of that then, knowing that the the subject's name would be X & Y pixels from the anchor term provide the actual name as the result.

So like a position-aware query but based on the actual OCR coordinates rather than the position of the term in the text.

jbaiter commented 4 years ago

I see! A hacky and probably inefficient way to do this without changes to the plugin could be:

  1. Perform a query for Name to get candidate locations for the anchor
  2. Apply some heuristics to determine which of those candidates are actually anchors
  3. Based on the anchor location, determine one or more regions where the subject name is likely to be located
  4. Query for the terms on the anchor location's page (with the hl.ocr.pageId parameter) and throw away all snippets that don't overlap with the subject name regions
  5. You're hopefully left with matches for your subject name in close proximity to a Name field title.
legsak1mbo commented 4 years ago

Right, I see. But presumably that wouldn't work with a wildcard search (search for any name in the region) because you wouldn't get the highlighting for that?

jbaiter commented 4 years ago

Yes, correct, if you're just interested in the general content of a region, you can replace step 4 and 5 with just parsing the OCR for the page and extracting the text in the subject name regions yourself.

legsak1mbo commented 4 years ago

Thanks. I'll come back with any progress I make.