apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.27k stars 1.23k forks source link

Text-index does not support multi-token substring search where first and last tokens are partial #10863

Open jackluo923 opened 1 year ago

jackluo923 commented 1 year ago

We've seen many cases where a user wants to search for a substring in a field with a text index. If all of the tokens in the query are complete words, we can directly use a phrase search: SELECT * FROM table WHERE text_match("col", '"substring match query"') However, if the first or last token is a partial word (e.g., "string match que"), the query will not return any results. Treating the query as regex text-match query does not work either as Pinot only supports regex match on a single token. To work around the limitation, we can use this a query like this: SELECT * FROM table WHERE text_match("col", '/*string/ AND match AND /que*/') AND "col" LIKE "%string match que%" But this is very slow and computationally expensive due to the LIKE which is necessary for validating the order of the tokens.

chenboat commented 1 year ago

This stackoverflow question suggests ElasticSearch (built on top of Lucene) supports similar substring search feature. https://stackoverflow.com/questions/44791075/in-elasticsearch-how-do-i-search-for-an-arbitrary-substring

cc @atris @siddharthteotia

hpvd commented 3 weeks ago

@jackluo923 does this merged feature solves this issue for you? If so, we could close this issue...

jackluo923 commented 3 weeks ago

@hpvd We'll likely back out the merged feature and enable a similar feature via run-time configurable analyzer & query parser, as proposed in this PR. There are a couple of reasons for this:

  1. There are some minor bugs and limitations with PR #12680 that cannot be fixed.
  2. In production, there are many cases where the analyzer needs to be adjusted, and a custom query parser needs to accompany it.

We have been using PR#13003 in production over petabytes of data without issues and will work towards closing the PR, and this issue soon.

hpvd commented 3 weeks ago

awsome! Many thanks for the detailed description of the current state and experience!

chenboat commented 3 weeks ago

12680 enabled the support for wildcard (including prefix and suffix) matching for the terms in a phrase search. @hpvd you can follow the https://docs.pinot.apache.org/basics/indexing/text-search-support#phrase-search-with-wildcard-term-matching to test the feature. We tested it with the default analyzer with Pinot on common phrase search with wildcard terms. If you find any issue, please let us know. As this feature is new and we leveage Lucene index (which is designed originally for keyword search)for regex expression search , we can not rule out any corner cases given the complexity of regex.