Open jackluo923 opened 1 year ago
cc Lucene experts: @atris @siddharthteotia
I am trying to understand this -- is the ask that stop words be not removed during analysis (i.e. allow pluggable analysers)?
The ask is that
Currently, stop words are not removed from query at all which means if query contains stop words, the query will not match any results. To give you a concrete example, let's use the input example provided in Pinot's documentation with default text-index ingestion configs:
Distributed systems, Java, C++, Go, distributed query engines for analytics and data warehouses, Machine learning, spark, Kubernetes, transaction processing, Java, Python, C++, Machine learning, building and deploying large scale... CUDA, GPU processing, Tensor flow ...
With the above input, the following query from Pinot's documentation would return a match:
SELECT SKILLS_COL
FROM MyTable
WHERE TEXT_MATCH(SKILLS_COL, '"Machine learning" AND "gpu processing"')
However, the following query would not return any match for the same input because the query contains the stop words for
and and
SELECT SKILLS_COL
FROM MyTable
WHERE TEXT_MATCH(SKILLS_COL, '"query engines for analytics" AND "building and deploying"')
Removing the stop words from the query should allow queries to return results
SELECT SKILLS_COL
FROM MyTable
WHERE TEXT_MATCH(SKILLS_COL, '"query engines analytics" AND "building deploying"')
The current workaround for this bug is either to remove the stop words from the query externally or disable stop words entirely. If there's no plan to fix this bug, we should modify this claim in the Pinot documentation:
Any occurrence of these words in will be ignored by the tokenizer during index creation and search.
to
Any occurrence of these words in will be ignored by the tokenizer during index creation but not during search.
Lucene strips away stop words and symbols prior to indexing but it seems like Pinot doesn't do the same when running queries on a text index. As a result, a query like:
SELECT * FROM table WHERE text_match("col", '"function not in list"')
will not return any result if the wordsnot
andin
are stop words that were stripped out during ingestion. A temporary workaround is to exclude all stop words in the index config.