Princeton-CDH / geniza

version 4.x of the Princeton Geniza Project
https://geniza.princeton.edu
Apache License 2.0
11 stars 2 forks source link

RegEx search (#1631) #1588

Closed blms closed 1 month ago

blms commented 4 months ago

In this PR

Per #1631:

Questions

Additional notes - `regex_search` is where the lucene regex query is built - `transcription_regex` field is the only one searched in this mode - `get_regex_highlight` is where the results are manually highlighted - used a bit of fancy regex to get ~150 characters of context before and after the highlight, terminating at word boundaries - team wanted to be able to search across multiple lines of transcription like PGPv3, so regex results in context do not display line numbers (also like PGPv3), and this was deemed an acceptable tradeoff - updated the `clean_html` method to prevent extra whitespace getting added inside `` and `
  • ` tags, as it otherwise breaks formatting for highlights - fwiw: performance using django ORM was about the same as solr for me locally, and the team confirmed it performs well in testing, so no need to reimplement unrelated to regex search: - we're now getting matches across multiple transcriptions on the same document sometimes, so I added a little ellipsis to the template in case that happens - also added a feature flag and template logic/css for displaying relevance score
  • codecov[bot] commented 4 months ago

    Codecov Report

    All modified and coverable lines are covered by tests :white_check_mark:

    Project coverage is 98.90%. Comparing base (3b566f9) to head (d8c1da7). Report is 33 commits behind head on develop.

    Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #1588 +/- ## ========================================= Coverage 98.89% 98.90% ========================================= Files 241 241 Lines 14724 14851 +127 ========================================= + Hits 14561 14688 +127 Misses 163 163 ```