Closed kunwp1 closed 1 day ago
Thanks for the PR! The reason and the fix make sense. I see you added test cases, that's really helpful!
Two concerns:
SimpleAnalyzer
? If so do we want users to be able to choose from different searching modes?StandardAnalyzer
efficient enough? Can you do a comparison of the time, when searching with different analyzer against 10K tuples?Thanks for the PR! The reason and the fix make sense. I see you added test cases, that's really helpful!
Two concerns:
- Are there any benefits of using
SimpleAnalyzer
? If so do we want users to be able to choose from different searching modes?- Is
StandardAnalyzer
efficient enough? Can you do a comparison of the time, when searching with different analyzer against 10K tuples?
The main difference between SimpleAnalyzer
and StandardAnalyzer
lies in how they handle numeric and special character searches. SimpleAnalyzer
ignores these, whereas StandardAnalyzer
includes them. I believe offering SimpleAnalyzer
may not be necessary since, for most users, it’s intuitive to include numbers in keyword searches—for instance, as seen in common search engines like Google.
I conducted a performance test using 4.5M tuples (1.09GB) of data with the keyword "happy well."
Thanks for the PR! The reason and the fix make sense. I see you added test cases, that's really helpful! Two concerns:
- Are there any benefits of using
SimpleAnalyzer
? If so do we want users to be able to choose from different searching modes?- Is
StandardAnalyzer
efficient enough? Can you do a comparison of the time, when searching with different analyzer against 10K tuples?
- The main difference between
SimpleAnalyzer
andStandardAnalyzer
lies in how they handle numeric and special character searches.SimpleAnalyzer
ignores these, whereasStandardAnalyzer
includes them. I believe offeringSimpleAnalyzer
may not be necessary since, for most users, it’s intuitive to include numbers in keyword searches—for instance, as seen in common search engines like Google.- I conducted a performance test using 4.5M tuples (1.09GB) of data with the keyword "happy well."
- StandardAnalyzer: 1 minute 19 seconds
- SimpleAnalyzer: 1 minute 21 seconds There’s no significant difference in performance between the two.
great! please include some of those information in the PR description.
This PR addresses an issue with the keyword search operator when the search term contains a digit. The design of the keyword search operator can be found here.
Background
Previously, the keyword search operator utilized the
SimpleAnalyzer
provided by Lucene. However, this analyzer's behavior caused problems with numeric tokens due to its tokenization strategy. Specifically,SimpleAnalyzer
breaks input strings into lowercase alphanumeric tokens and discards standalone numbers or special characters unless they are part of an alphanumeric token.For example:
"3 stars"
["stars"]
The number3
is ignored because it is not part of an alphanumeric token. This behavior resulted in an inability to search for standalone numeric keywords, such as"3"
, because they were never indexed.Fix
To resolve this issue, the analyzer was updated to use the
StandardAnalyzer
instead of theSimpleAnalyzer
. TheStandardAnalyzer
retains numeric tokens during analysis, ensuring that standalone numbers like3
are properly indexed and searchable.I conducted a performance test using 4.5M tuples (1.09GB) of data with the keyword "happy well."
There’s no significant difference in performance between the two.
Addressing Partial Matches
An additional concern involved cases like:
This occurs because Lucene's
StandardAnalyzer
tokenizes and stems both the field values and the queries. For instance:"3 stars"
is tokenized as["3", "stars"]
"4 stars"
is tokenized as["4", "stars"]
When searching for"3 stars"
, Lucene generates a query matching both tokens:["3"]
and["stars"]
. Since"4 stars"
contains the token"stars"
, it partially matches the query and is included in the results.To ensure exact matches, users should wrap the query in double quotes (
"3 stars"
). This prevents Lucene from tokenizing the input and treats it as a single token for exact matching.https://github.com/user-attachments/assets/7be9ca50-1d87-4ff3-89d8-1ac28ef42e0c