Fix Keyword Search Operator for Numeric Tokens and Improve Exact Matching Behavior

kunwp1 commented 1 day ago

This PR addresses an issue with the keyword search operator when the search term contains a digit. The design of the keyword search operator can be found here.

Background

Previously, the keyword search operator utilized the SimpleAnalyzer provided by Lucene. However, this analyzer's behavior caused problems with numeric tokens due to its tokenization strategy. Specifically, SimpleAnalyzer breaks input strings into lowercase alphanumeric tokens and discards standalone numbers or special characters unless they are part of an alphanumeric token.

For example:

Input: "3 stars"
Tokenized: ["stars"] The number 3 is ignored because it is not part of an alphanumeric token. This behavior resulted in an inability to search for standalone numeric keywords, such as "3", because they were never indexed.

Fix

To resolve this issue, the analyzer was updated to use the StandardAnalyzer instead of the SimpleAnalyzer. The StandardAnalyzer retains numeric tokens during analysis, ensuring that standalone numbers like 3 are properly indexed and searchable.

I conducted a performance test using 4.5M tuples (1.09GB) of data with the keyword "happy well."

StandardAnalyzer: 1 minute 19 seconds
SimpleAnalyzer: 1 minute 21 seconds

There’s no significant difference in performance between the two.

Addressing Partial Matches

An additional concern involved cases like:

Why does "3 stars" also match "4 stars"?

This occurs because Lucene's StandardAnalyzer tokenizes and stems both the field values and the queries. For instance:

"3 stars" is tokenized as ["3", "stars"]
"4 stars" is tokenized as ["4", "stars"] When searching for "3 stars", Lucene generates a query matching both tokens: ["3"] and ["stars"]. Since "4 stars" contains the token "stars", it partially matches the query and is included in the results.

To ensure exact matches, users should wrap the query in double quotes ("3 stars"). This prevents Lucene from tokenizing the input and treats it as a single token for exact matching.

https://github.com/user-attachments/assets/7be9ca50-1d87-4ff3-89d8-1ac28ef42e0c

Yicong-Huang commented 1 day ago

Thanks for the PR! The reason and the fix make sense. I see you added test cases, that's really helpful!

Two concerns:

Are there any benefits of using SimpleAnalyzer? If so do we want users to be able to choose from different searching modes?
Is StandardAnalyzer efficient enough? Can you do a comparison of the time, when searching with different analyzer against 10K tuples?

kunwp1 commented 1 day ago

Thanks for the PR! The reason and the fix make sense. I see you added test cases, that's really helpful!

Two concerns:

Are there any benefits of using SimpleAnalyzer? If so do we want users to be able to choose from different searching modes?

Is StandardAnalyzer efficient enough? Can you do a comparison of the time, when searching with different analyzer against 10K tuples?

The main difference between SimpleAnalyzer and StandardAnalyzer lies in how they handle numeric and special character searches. SimpleAnalyzer ignores these, whereas StandardAnalyzer includes them. I believe offering SimpleAnalyzer may not be necessary since, for most users, it’s intuitive to include numbers in keyword searches—for instance, as seen in common search engines like Google.
I conducted a performance test using 4.5M tuples (1.09GB) of data with the keyword "happy well."
- StandardAnalyzer: 1 minute 19 seconds
- SimpleAnalyzer: 1 minute 21 seconds There’s no significant difference in performance between the two.

Yicong-Huang commented 1 day ago

Thanks for the PR! The reason and the fix make sense. I see you added test cases, that's really helpful! Two concerns:

Are there any benefits of using SimpleAnalyzer? If so do we want users to be able to choose from different searching modes?

Is StandardAnalyzer efficient enough? Can you do a comparison of the time, when searching with different analyzer against 10K tuples?

The main difference between SimpleAnalyzer and StandardAnalyzer lies in how they handle numeric and special character searches. SimpleAnalyzer ignores these, whereas StandardAnalyzer includes them. I believe offering SimpleAnalyzer may not be necessary since, for most users, it’s intuitive to include numbers in keyword searches—for instance, as seen in common search engines like Google.

I conducted a performance test using 4.5M tuples (1.09GB) of data with the keyword "happy well."

StandardAnalyzer: 1 minute 19 seconds

SimpleAnalyzer: 1 minute 21 seconds There’s no significant difference in performance between the two.

great! please include some of those information in the PR description.

Texera / texera

Fix Keyword Search Operator for Numeric Tokens and Improve Exact Matching Behavior #3106

Background

Fix

Addressing Partial Matches