Texera / texera

Collaborative Machine-Learning-Centric Data Analytics Using Workflows
https://texera.github.io
Apache License 2.0
163 stars 75 forks source link

Fix Keyword Search Operator for Numeric Tokens and Improve Exact Matching Behavior #3106

Closed kunwp1 closed 1 day ago

kunwp1 commented 1 day ago

This PR addresses an issue with the keyword search operator when the search term contains a digit. The design of the keyword search operator can be found here.

Background

Previously, the keyword search operator utilized the SimpleAnalyzer provided by Lucene. However, this analyzer's behavior caused problems with numeric tokens due to its tokenization strategy. Specifically, SimpleAnalyzer breaks input strings into lowercase alphanumeric tokens and discards standalone numbers or special characters unless they are part of an alphanumeric token.

For example:

Fix

To resolve this issue, the analyzer was updated to use the StandardAnalyzer instead of the SimpleAnalyzer. The StandardAnalyzer retains numeric tokens during analysis, ensuring that standalone numbers like 3 are properly indexed and searchable.

I conducted a performance test using 4.5M tuples (1.09GB) of data with the keyword "happy well."

There’s no significant difference in performance between the two.

Addressing Partial Matches

An additional concern involved cases like:

This occurs because Lucene's StandardAnalyzer tokenizes and stems both the field values and the queries. For instance:

To ensure exact matches, users should wrap the query in double quotes ("3 stars"). This prevents Lucene from tokenizing the input and treats it as a single token for exact matching.

https://github.com/user-attachments/assets/7be9ca50-1d87-4ff3-89d8-1ac28ef42e0c

Yicong-Huang commented 1 day ago

Thanks for the PR! The reason and the fix make sense. I see you added test cases, that's really helpful!

Two concerns:

  1. Are there any benefits of using SimpleAnalyzer? If so do we want users to be able to choose from different searching modes?
  2. Is StandardAnalyzer efficient enough? Can you do a comparison of the time, when searching with different analyzer against 10K tuples?
kunwp1 commented 1 day ago

Thanks for the PR! The reason and the fix make sense. I see you added test cases, that's really helpful!

Two concerns:

  1. Are there any benefits of using SimpleAnalyzer? If so do we want users to be able to choose from different searching modes?
  2. Is StandardAnalyzer efficient enough? Can you do a comparison of the time, when searching with different analyzer against 10K tuples?
  1. The main difference between SimpleAnalyzer and StandardAnalyzer lies in how they handle numeric and special character searches. SimpleAnalyzer ignores these, whereas StandardAnalyzer includes them. I believe offering SimpleAnalyzer may not be necessary since, for most users, it’s intuitive to include numbers in keyword searches—for instance, as seen in common search engines like Google.

  2. I conducted a performance test using 4.5M tuples (1.09GB) of data with the keyword "happy well."

    • StandardAnalyzer: 1 minute 19 seconds
    • SimpleAnalyzer: 1 minute 21 seconds There’s no significant difference in performance between the two.
Yicong-Huang commented 1 day ago

Thanks for the PR! The reason and the fix make sense. I see you added test cases, that's really helpful! Two concerns:

  1. Are there any benefits of using SimpleAnalyzer? If so do we want users to be able to choose from different searching modes?
  2. Is StandardAnalyzer efficient enough? Can you do a comparison of the time, when searching with different analyzer against 10K tuples?
  1. The main difference between SimpleAnalyzer and StandardAnalyzer lies in how they handle numeric and special character searches. SimpleAnalyzer ignores these, whereas StandardAnalyzer includes them. I believe offering SimpleAnalyzer may not be necessary since, for most users, it’s intuitive to include numbers in keyword searches—for instance, as seen in common search engines like Google.
  2. I conducted a performance test using 4.5M tuples (1.09GB) of data with the keyword "happy well."
  • StandardAnalyzer: 1 minute 19 seconds
  • SimpleAnalyzer: 1 minute 21 seconds There’s no significant difference in performance between the two.

great! please include some of those information in the PR description.