Support for Native Text Indexing in Pinot

atris commented 3 years ago

Build a fully functional text search engine on top of native Pinot indices, allowing exact matches, prefix and suffix matches, substring matches and regular expressions.

Performance of the text search component (automaton, matcher and FST) should be comparable or better than Lucene's FST, matcher and automaton.

Build the engine using core Pinot capabilities and have it deeply integrated with Pinot's core components

Allow the library to be reusable across Pinot.

Allow the library to be extensible without requiring application changes

Please see:

https://docs.google.com/document/d/1PMhoRy6WF46C4d4mw0LVe9b8Vjqes6vsXZkmxXzMYzw/edit#heading=h.krgi6ulfrbxj

kkrugler commented 3 years ago

As part of the design, it would be great to see details on how different analysis chains can be specified (e.g. based on target language, for a column).

It would also be great to plan for supporting dynamic analysis chains, typically based on language classification (either manual or dynamic), per row, as that solves a major problem for many text search use cases.

siddharthteotia commented 3 years ago

Thanks @atris . I would like to review. Please give me couple of days to go through the doc.

siddharthteotia commented 3 years ago

High level question

After we added lucene based text index, there was FST text index added which uses Lucene FST libraries for purely regex searches as the former one took more storage if the user only wanted regex searches and not term, phrase etc.

For the current proposal of native text indexes, are we planning to re-implement all of Lucene libraries for search within Pinot ?

kishoreg commented 3 years ago

Looking at the doc and PR, it looks like only the FST is being implemented. Pinot already has the posting list and the ability to evaluate boolean expressions over posting lists. Is there anything else?

atris commented 3 years ago

No, FST and regexp automaton are the only two things - - rest of the operations will be performed by Pinot indices

On Fri, 10 Sep 2021, 01:10 Kishore Gopalakrishna, @.***> wrote:

Looking at the doc and PR, it looks like only the FST is being implemented. Pinot already has the posting list and the ability to evaluate boolean expressions over posting lists. Is there anything else?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/pinot/issues/7395#issuecomment-916383166, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANE5Y42X34HBC3VHHKYSCLUBEETDANCNFSM5DMMPHBQ .

siddharthteotia commented 3 years ago

Wanted to understand a few things better.

IIUC, this is our current state

Lucene text index - wrapper over Lucene to get phrase, term, regex etc search query functionality
Lucene FST index - this was added later (by Confluera I guess) to use FST libraries of Lucene to purely get regex search functionality from Lucene.

In both the above cases, we get FST and regexp automaton as part of using Lucene. We also advise users to not use Lucene text index if they want to do exact matches since Pinot's native inverted index is way faster for exact matches. When we say we are implementing native FST index, what exact functionality are we adding and/or improving ? This is not clear in the design doc. The doc talks about control/flexibility and potential future improvements but they are a bit vague IMHO and few more details can be added in those sections.

My guess is that this is about improving phrase, regex and fuzzy search by building a native FST index which can work on top of existing Pinot's native structures -- inverted index and dictionary. So it seems like a bridge is missing between Pinot's native inv index and dictionary structure and Lucene FST. Is this correct ? If so, can this not be achieved by continuing to use Lucene FST library as opposed to putting it into Pinot. Something we already do as part of Lucene FST index.

Also, how will this new work be different from what is currently offered by Lucene FST index in terms of functionality and performance. There are some performance charts but if I am reading them right, the improvement seems marginal.

Also, thanks for clarifying in the doc that this work won't regress the TEXT_MATCH functionality (query syntax etc) and performance. In case, we go ahead with this new work, I think from the end state, we should not have the mandatory step of removing current Lucene text index and TEXT_MATCH. If someone wants to migrate, there should be a migration path. Rest of the users can continue to use what we have today

atris commented 3 years ago

Thanks for reviewing the document, @siddharthteotia !

Here are my thoughts:

Current text search infrastructure: Status quo, we simply build side car Lucene indices and expose a UDF which allows users to specify Lucene queries. IMO, this is a component that should ideally be outside of Pinot since it has no correlation with Pinot itself.

So, an eventual goal is to move text search to native Pinot indices and dictionary, and follow the SQL Standard (LIKE operator) syntax as much as possible.

Now, coming to the FST itself. There are three reasons as to why a native FST makes sense:

Flexibility and Control -- Lucene is a full fledged search library. It is built for generic text search use cases and consists of capabilities which allow ranked retrieval, norm storage and impact filtering, to name a few capabilities. None of these are of relevance to us since we do not perform ranking. As I mentioned before, if we are building our text search capabilities on top of Pinot data structures, then pulling in Lucene just for the FST is an overkill, and also stops us from any potential changes that we may wish to do. Lucene's FST is a generic engine, not optimized for our use cases (only dictionary IDs as output symbols, primary query load being prefix and suffix matches from LIKE operator). Other improvements may or may not come in later, but if we do not move to our native implementation, we remove the possibility of any such improvements.
Ability to perform Pinot specific optimizations -- As stated in the above point, it is not possible for us to do specific changes/enhancements. For e.g., it should be possible to short circuit the evaluation of regular expressions ending with match-all and having a short prefix before the same, thus accelerating a common use case of LIKE operator.
Realtime Capabilities -- Lucene builds FST during segment flush, thus forcing us to flush frequently. Also, this inhibits us from doing real time text search, which is a limitation. With a native FST implementation, we should be able to explore this path.

Regarding TEXT_MATCH, while it is my dearest wish to deprecate the module, I understand that some users may wish to use it. As highlighted, both indices can co exist, with no mandate to migrate to one over the other.

siddharthteotia commented 3 years ago

I had followed up for clarifying few additional things with @atris in slack channel. Copying here for reference and visibility

Can we all confirm the following ? I am sorry to have asked this couple of times as part of different threads in the doc but since doc still indicates some sort of migration Note that till completion of phase 4, we will be maintaining the existing text indices within Pinot. I just want to make sure

Existing Lucene text index functionality offered via TEXT_MATCH will continue to work as is and is essentially untouched by this work
Both indexes can co-exist and we are not removing Lucene dependency ?
Upon segment reload, existing Lucene index can potentially be converted to new format (if need be). However, if someone wishes to do this, how will the query syntax used in TEXT_MATCH from lucene based remain compliant for native FST index (which I believe will follow SQL LIKE semantics). I am guessing the users will have to change queries if they wish to migrate ?
For the native FST index, the plan is to eventually support all kinds of searches -- phrase, term, regex, fuzzy etc. So for example, phrase search needs position info which I am not sure if it comes for free as part of FST. Regardless, all of that is the end state and comprehensive text search functionality will be available through this native index ?
- This is important for us because eventually (and this is a big eventual for us :slightly_smiling_face: ) we might want to migrate our production Li users from Lucene text index to native FST index if performance is better. I can't promise if that will happen as it will certainly be a lot of work (hence seeking confirmation that we are not removing anything). Our production users use a lot of phrase queries.
General question - are you planning to make this functionality available both via LIKE and TEXT_MATCH or want to keep it separate and just use LIKE ? Latter can also be overloaded as long as user docs clearly indicate that TEXT_MATCH can be used for both native and lucene text index
Request on code - since FST is like a black box (for me except for whatever I learned from paper and online presentations), can you please make sure that code is sufficiently documented and explains algorithm as and when needed. Initially, we were just relying on Lucene committers but now we will have to maintain. This will also help with easy review

@atris 's response

Yes, Lucene Indices and TEXT_MATCH will be completely untouched and unaffected by this effort.
No, we are not removing the dependency and both indices can coexist, oblivious of each other.
Here is the interesting one. Native FST can support all queries that Lucene does. However, since our indices do not store some metadata (such as positional index) that Lucene Indices do, we will have to implement custom operators on top of native FST. However, syntactically, native FST shall pose no challenges in that implementation. If there are specific operators outside of the four planned currently (regexp, like, phrase and fuzzy) that will be needed for users to migrate, I will be more than happy to support.
Yes, in the end state, comprehensive text search will be natively available.
I was actually not planning to overload TEXT_MATCH since it basically supports Lucene syntax, but rather have custom functions for phrase, fuzzy and regexp, and let the LIKE operator deal with the rest. However, there is no reason why we can't go down that route.
I completely agree. I have tried to document the code as elaborately as possible and also written supporting documents (e.g. On the Regexp compilation process). If there is more needed on specific areas, I will gladly write more :)

Based on above clarifications, I am ok with proceeding

@amrishlal , @jackjlli please feel free to add any additional discussion notes

siddharthteotia commented 2 years ago

Did a brief sync up with @atris yesterday. We would like to try this out for functional and perf testing on our prod use case as soon as the phrase and term search part is complete. Will have to discuss/ handle index conversion and query migration (if possible). We will collaborate during testing / rollout for any feature, perf gaps etc.

cc @vvivekiyer @jasperjiaguo

apache / pinot

Support for Native Text Indexing in Pinot #7395