apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.42k stars 1.27k forks source link

Support for Native Text Indexing in Pinot #7395

Open atris opened 3 years ago

atris commented 3 years ago

Build a fully functional text search engine on top of native Pinot indices, allowing exact matches, prefix and suffix matches, substring matches and regular expressions.

Performance of the text search component (automaton, matcher and FST) should be comparable or better than Lucene's FST, matcher and automaton.

Build the engine using core Pinot capabilities and have it deeply integrated with Pinot's core components

Allow the library to be reusable across Pinot.

Allow the library to be extensible without requiring application changes

Please see:

https://docs.google.com/document/d/1PMhoRy6WF46C4d4mw0LVe9b8Vjqes6vsXZkmxXzMYzw/edit#heading=h.krgi6ulfrbxj

kkrugler commented 3 years ago

As part of the design, it would be great to see details on how different analysis chains can be specified (e.g. based on target language, for a column).

It would also be great to plan for supporting dynamic analysis chains, typically based on language classification (either manual or dynamic), per row, as that solves a major problem for many text search use cases.

siddharthteotia commented 3 years ago

Thanks @atris . I would like to review. Please give me couple of days to go through the doc.

siddharthteotia commented 3 years ago

High level question

After we added lucene based text index, there was FST text index added which uses Lucene FST libraries for purely regex searches as the former one took more storage if the user only wanted regex searches and not term, phrase etc.

For the current proposal of native text indexes, are we planning to re-implement all of Lucene libraries for search within Pinot ?

kishoreg commented 3 years ago

Looking at the doc and PR, it looks like only the FST is being implemented. Pinot already has the posting list and the ability to evaluate boolean expressions over posting lists. Is there anything else?

atris commented 3 years ago

No, FST and regexp automaton are the only two things - - rest of the operations will be performed by Pinot indices

On Fri, 10 Sep 2021, 01:10 Kishore Gopalakrishna, @.***> wrote:

Looking at the doc and PR, it looks like only the FST is being implemented. Pinot already has the posting list and the ability to evaluate boolean expressions over posting lists. Is there anything else?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/pinot/issues/7395#issuecomment-916383166, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANE5Y42X34HBC3VHHKYSCLUBEETDANCNFSM5DMMPHBQ .

siddharthteotia commented 3 years ago

Wanted to understand a few things better.

IIUC, this is our current state

In both the above cases, we get FST and regexp automaton as part of using Lucene. We also advise users to not use Lucene text index if they want to do exact matches since Pinot's native inverted index is way faster for exact matches. When we say we are implementing native FST index, what exact functionality are we adding and/or improving ? This is not clear in the design doc. The doc talks about control/flexibility and potential future improvements but they are a bit vague IMHO and few more details can be added in those sections.

My guess is that this is about improving phrase, regex and fuzzy search by building a native FST index which can work on top of existing Pinot's native structures -- inverted index and dictionary. So it seems like a bridge is missing between Pinot's native inv index and dictionary structure and Lucene FST. Is this correct ? If so, can this not be achieved by continuing to use Lucene FST library as opposed to putting it into Pinot. Something we already do as part of Lucene FST index.

Also, how will this new work be different from what is currently offered by Lucene FST index in terms of functionality and performance. There are some performance charts but if I am reading them right, the improvement seems marginal.

Also, thanks for clarifying in the doc that this work won't regress the TEXT_MATCH functionality (query syntax etc) and performance. In case, we go ahead with this new work, I think from the end state, we should not have the mandatory step of removing current Lucene text index and TEXT_MATCH. If someone wants to migrate, there should be a migration path. Rest of the users can continue to use what we have today

atris commented 3 years ago

Thanks for reviewing the document, @siddharthteotia !

Here are my thoughts:

Current text search infrastructure: Status quo, we simply build side car Lucene indices and expose a UDF which allows users to specify Lucene queries. IMO, this is a component that should ideally be outside of Pinot since it has no correlation with Pinot itself.

So, an eventual goal is to move text search to native Pinot indices and dictionary, and follow the SQL Standard (LIKE operator) syntax as much as possible.

Now, coming to the FST itself. There are three reasons as to why a native FST makes sense:

  1. Flexibility and Control -- Lucene is a full fledged search library. It is built for generic text search use cases and consists of capabilities which allow ranked retrieval, norm storage and impact filtering, to name a few capabilities. None of these are of relevance to us since we do not perform ranking. As I mentioned before, if we are building our text search capabilities on top of Pinot data structures, then pulling in Lucene just for the FST is an overkill, and also stops us from any potential changes that we may wish to do. Lucene's FST is a generic engine, not optimized for our use cases (only dictionary IDs as output symbols, primary query load being prefix and suffix matches from LIKE operator). Other improvements may or may not come in later, but if we do not move to our native implementation, we remove the possibility of any such improvements.

  2. Ability to perform Pinot specific optimizations -- As stated in the above point, it is not possible for us to do specific changes/enhancements. For e.g., it should be possible to short circuit the evaluation of regular expressions ending with match-all and having a short prefix before the same, thus accelerating a common use case of LIKE operator.

  3. Realtime Capabilities -- Lucene builds FST during segment flush, thus forcing us to flush frequently. Also, this inhibits us from doing real time text search, which is a limitation. With a native FST implementation, we should be able to explore this path.

Regarding TEXT_MATCH, while it is my dearest wish to deprecate the module, I understand that some users may wish to use it. As highlighted, both indices can co exist, with no mandate to migrate to one over the other.

siddharthteotia commented 3 years ago

I had followed up for clarifying few additional things with @atris in slack channel. Copying here for reference and visibility


Can we all confirm the following ? I am sorry to have asked this couple of times as part of different threads in the doc but since doc still indicates some sort of migration Note that till completion of phase 4, we will be maintaining the existing text indices within Pinot. I just want to make sure

@atris 's response


Based on above clarifications, I am ok with proceeding

@amrishlal , @jackjlli please feel free to add any additional discussion notes

siddharthteotia commented 2 years ago

Did a brief sync up with @atris yesterday. We would like to try this out for functional and perf testing on our prod use case as soon as the phrase and term search part is complete. Will have to discuss/ handle index conversion and query migration (if possible). We will collaborate during testing / rollout for any feature, perf gaps etc.

cc @vvivekiyer @jasperjiaguo