Open legsak1mbo opened 4 years ago
Doing this "properly" for arbitrary regions is out of scope for this specific plugin I'm afraid, since it does not store any information about the actual coordinates in the index and thus can't query for it (e.g. like solr's Spatial Search).
One hacky way to go about this would be to add a filterBbbox
parameter that is then checked at highlighting time against the bounding boxes in the OCR file. All snippets falling outside of the queried bounding box would be filtered out. This shouldn't be too hard to implement, since we have access to the bounding box information at highlighting time and can thus filter very easily based on it. This could be a good issue for a pull request for a new developer :-)
There is however currently support for filtering by a specific page in a document, check out the hl.ocr.pageId
parameter in the documentation. Combined with a fq
on the document id it allows you to limit the snippet generation to a single page in a single document. We use this to implement the IIIF Content Search API, which requires searching in a single page of a document.
Sorry I've been so long coming back to this. The ideal would be if we could search for "the first instance of a term after the previous" and/or "a term X & Y away from an anchor term" where the anchor would be something like a chapter title. Java isn't really my forte but I'll certainly look into it.
If you want to implement search inside of chapters, you could just index your documents at the chapter-level by creating source pointers that point to the markup for that chapter, this is described in the documentation here: https://dbmdz.github.io/solr-ocrhighlighting/indexing/#one-or-more-partial-files-per-solr-document.
Otherwise this is hard to implement with Lucene/Solr and the plugin in its current form, you could try sloppy phrase queries like "<chapter_word> <term>"~20
, which would yield all spans where <term>
appears within 20 token-positions of <chapter_word>
, but this will also include cases where the term appears before the chapter.
I'm not sure if the approach proposed in my first response is going to work for you, since you'd need to know the specific region on a given page where a match is allowed to occur. This could be useful for a feature like "search only in headers/footers" (if those headers/footers appear in the same positions every time), but that is not your use case if I understood you correctly?
What I'm thinking is something like an old census form where the scans are all slightly wonky. The idea would be that you could use something like "Name" as an anchor and search for the first instance of that then, knowing that the the subject's name would be X & Y pixels from the anchor term provide the actual name as the result.
So like a position-aware query but based on the actual OCR coordinates rather than the position of the term in the text.
I see! A hacky and probably inefficient way to do this without changes to the plugin could be:
Name
to get candidate locations for the anchorhl.ocr.pageId
parameter) and throw away all snippets that don't overlap with the subject name regionsName
field title.Right, I see. But presumably that wouldn't work with a wildcard search (search for any name in the region) because you wouldn't get the highlighting for that?
Yes, correct, if you're just interested in the general content of a region, you can replace step 4 and 5 with just parsing the OCR for the page and extracting the text in the subject name regions yourself.
Thanks. I'll come back with any progress I make.
More of a feature request than an issue but it would be incredibly useful if the HOCR data could be used for querying as well as highlighting. For example searching for a word within a specific region of the document by its page and/or coordinates.