apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.53k stars 1.3k forks source link

text_match operator fails to execute query containing stop words #10865

Open jackluo923 opened 1 year ago

jackluo923 commented 1 year ago

Lucene strips away stop words and symbols prior to indexing but it seems like Pinot doesn't do the same when running queries on a text index. As a result, a query like: SELECT * FROM table WHERE text_match("col", '"function not in list"') will not return any result if the words not and in are stop words that were stripped out during ingestion. A temporary workaround is to exclude all stop words in the index config.

Jackie-Jiang commented 1 year ago

cc Lucene experts: @atris @siddharthteotia

atris commented 1 year ago

I am trying to understand this -- is the ask that stop words be not removed during analysis (i.e. allow pluggable analysers)?

jackluo923 commented 1 year ago

The ask is that

  1. If default list of stop words are used during ingestion, remove the default list of stop words from the query during query time
  2. If stop words all stop words are excluded, we should not remove any stop words from the query
  3. If a customized list of stop words are excluded, only remove the customized list of stop words from the query

Currently, stop words are not removed from query at all which means if query contains stop words, the query will not match any results. To give you a concrete example, let's use the input example provided in Pinot's documentation with default text-index ingestion configs:

Distributed systems, Java, C++, Go, distributed query engines for analytics and data warehouses, Machine learning, spark, Kubernetes, transaction processing, Java, Python, C++, Machine learning, building and deploying large scale... CUDA, GPU processing, Tensor flow ...

With the above input, the following query from Pinot's documentation would return a match:

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"Machine learning" AND "gpu processing"')

However, the following query would not return any match for the same input because the query contains the stop words for and and

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"query engines for analytics" AND "building and deploying"')

Removing the stop words from the query should allow queries to return results

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"query engines analytics" AND "building deploying"')

The current workaround for this bug is either to remove the stop words from the query externally or disable stop words entirely. If there's no plan to fix this bug, we should modify this claim in the Pinot documentation:

Any occurrence of these words in will be ignored by the tokenizer during index creation and search.

to

Any occurrence of these words in will be ignored by the tokenizer during index creation but not during search.