Use the GLOB operator instead of LIKE for more capable character matching. This change allows us to identify 'words' within a gloss as those with either space or commas leading or trailing
This means we can support searching within multiple glosses, e.g. hello, salute.
GLOB is case sensitive, so we must transform the gloss and the term to lowercase. We also perform some simple escaping to avoid glob patterns being used within the search term itself.
In terms of capability, for this sort of matching, the ordering is pretty much LIKE-> GLOB -> MATCH. I'm trying to move incrementally before switching to regexp matching, for a couple of reasons:
Regexps are a bit harder to parse by readers, and tend to be overextended instead of switching to a different type of query as they grow more complex.
SQLite doesn't have a built-in regexp function by default. We can add one, but I'm unsure of the performance overhead of calling out of SQLite back into Ruby. I think it's negligible, but almost certainly slower than GLOB, which is native.
The glob being applied (*[ ,]:term[ ,]*) is broken down as following:
Any characters (including nothing) - *
Exactly one of comma, or space) - [ ,]
The term being searched for - :term
Exactly one of comma or space - [ ,]
Any characters (including nothing) - *
In the future, we have new matching rules planned for partial matches. This is likely to change the glob to add a wildcard either trailing the term, or surrounding the term. This will still examine each 'word' in a gloss, but will allow the word to partially match, rather than exactly matching.
This change was specifically introduced to resolve searching for multiple gloss words. The test case for this is "hello, salute", but there are plenty of others, where the gloss being searched for is the first word, but other glosses are also included (typically this happens when the same sign can mean different things depending on context AFAIK).
Before this change, "Hello, salute" was not included in a search for "hello", because the word Hello was not included in the gloss:
After this change, "Hello, salute" is included in the search for "hello", because the string hello, salute matches the glob *[ ,]hello[ ,]*:
Use the GLOB operator instead of LIKE for more capable character matching. This change allows us to identify 'words' within a gloss as those with either space or commas leading or trailing
This means we can support searching within multiple glosses, e.g. hello, salute.
GLOB is case sensitive, so we must transform the gloss and the term to lowercase. We also perform some simple escaping to avoid glob patterns being used within the search term itself.
In terms of capability, for this sort of matching, the ordering is pretty much
LIKE
->GLOB
->MATCH
. I'm trying to move incrementally before switching to regexp matching, for a couple of reasons:The glob being applied (
*[ ,]:term[ ,]*
) is broken down as following:In the future, we have new matching rules planned for partial matches. This is likely to change the glob to add a wildcard either trailing the term, or surrounding the term. This will still examine each 'word' in a gloss, but will allow the word to partially match, rather than exactly matching.
This change was specifically introduced to resolve searching for multiple gloss words. The test case for this is "hello, salute", but there are plenty of others, where the gloss being searched for is the first word, but other glosses are also included (typically this happens when the same sign can mean different things depending on context AFAIK).
Before this change, "Hello, salute" was not included in a search for "hello", because the word
Hello
was not included in the gloss:After this change,
"Hello, salute"
is included in the search for "hello", because the stringhello, salute
matches the glob*[ ,]hello[ ,]*
: