apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.2k stars 1.21k forks source link

Introduce Native Text Indices (Core Functionality) #7405

Closed atris closed 2 years ago

atris commented 2 years ago

https://docs.google.com/document/d/1PMhoRy6WF46C4d4mw0LVe9b8Vjqes6vsXZkmxXzMYzw/edit?usp=sharing

This PR implements Phase 1 (core functionality) of the given implementation plan.

Part of #7395

NOTE: This PR is raised to help move the design review forward. The acceptance of this PR is conditional to approval of the design document

siddharthteotia commented 2 years ago

Please give some time for design doc review. I had commented on the issue a couple of days ago to request time for review. Not sure if this has already been reviewed and approved

I would encourage to follow these guidelines since these were recently discussed and approved by PMC - https://docs.pinot.apache.org/developers/developers-and-contributors/contribution-guidelines#pinot-enhancement-proposal-workflow

kishoreg commented 2 years ago

I think he published the PR to help with the review of the design doc. As long as the PR does not get merged without the approval of the design, it should be ok.

@atris can you please update the PR description to indicate the same.

atris commented 2 years ago

Done, thanks

amrishlal commented 2 years ago

I would suggest creating a new pinot-fst module for the new FST implementation.

kishoreg commented 2 years ago

I would suggest creating a new pinot-fst module for the new FST implementation.

Not sure about this.. we don't create modules at root level for other indexing. Need to think carefully before creating modules at the root.. at some point, ideally, we should create plugin mechanisms for indexes and then create a module for each index.. we are not there yet

atris commented 2 years ago

I would suggest creating a new pinot-fst module for the new FST implementation.

Not sure about this.. we don't create modules at root level for other indexing. Need to think carefully before creating modules at the root.. at some point, ideally, we should create plugin mechanisms for indexes and then create a module for each index.. we are not there yet

Agreed. To add to Kishore's point, introducing new modules at root level increases the scope of jar conflicts. For e.g., in this PR, ImmutableFST uses the off heap bytes store in pinot-segment-local, thus creating a cyclic dependency. Until we do not have the plugin model, I would prefer this model.

amrishlal commented 2 years ago

I would suggest creating a new pinot-fst module for the new FST implementation.

Not sure about this.. we don't create modules at root level for other indexing. Need to think carefully before creating modules at the root.. at some point, ideally, we should create plugin mechanisms for indexes and then create a module for each index.. we are not there yet

It doesn't have to be a top-level module, but basically, I was looking for some way to sufficiently encapsulate this FST implementation using interfaces (in a separate jar if possible) and then use it within Pinot (?).

For e.g., in this PR, ImmutableFST uses the off heap bytes store in pinot-segment-local, thus creating a cyclic dependency.

Sounds like we need interfaces?

atris commented 2 years ago

It is already encapsulated within a package in Pinot-segment and has interfaces for interaction at each layer - please refer to the code

On Fri, 10 Sep 2021, 23:35 Amrish Lal, @.***> wrote:

I would suggest creating a new pinot-fst module for the new FST implementation.

Not sure about this.. we don't create modules at root level for other indexing. Need to think carefully before creating modules at the root.. at some point, ideally, we should create plugin mechanisms for indexes and then create a module for each index.. we are not there yet

It doesn't have to be a top-level module, but basically, I was looking for some way to sufficiently encapsulate this FST implementation using interfaces (in a separate jar if possible) and then use it within Pinot (?).

For e.g., in this PR, ImmutableFST uses the off heap bytes store in pinot-segment-local, thus creating a cyclic dependency. Sounds like we need interfaces?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/pinot/pull/7405#issuecomment-917104588, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANE5Y2H7WNUCTP3VX5OJIDUBJCE3ANCNFSM5DTCLYZQ .

mayankshriv commented 2 years ago

I would suggest creating a new pinot-fst module for the new FST implementation.

Not sure about this.. we don't create modules at root level for other indexing. Need to think carefully before creating modules at the root.. at some point, ideally, we should create plugin mechanisms for indexes and then create a module for each index.. we are not there yet

+1 to Kishore, we don't have (or create) new modules at root level for other indexing either.

mayankshriv commented 2 years ago

I think he published the PR to help with the review of the design doc. As long as the PR does not get merged without the approval of the design, it should be ok.

@atris can you please update the PR description to indicate the same.

Yes, I believe, the intention of the PR was primarily to help with design doc review. Because some aspects do require review beyond what a design doc is good at capturing. Let's get consensus on the design doc, followed by PR review.

codecov-commenter commented 2 years ago

Codecov Report

Merging #7405 (beb817e) into master (ac49978) will decrease coverage by 44.70%. The diff coverage is 0.00%.

:exclamation: Current head beb817e differs from pull request most recent head 0a2ba3f. Consider uploading reports for the commit 0a2ba3f to get more accurate results Impacted file tree graph

@@              Coverage Diff              @@
##             master    #7405       +/-   ##
=============================================
- Coverage     72.56%   27.86%   -44.71%     
=============================================
  Files          1523     1543       +20     
  Lines         75973    78625     +2652     
  Branches      11101    11671      +570     
=============================================
- Hits          55132    21911    -33221     
- Misses        17155    54738    +37583     
+ Partials       3686     1976     -1710     
Flag Coverage Δ
integration1 ?
integration2 27.86% <0.00%> (-1.05%) :arrow_down:
unittests1 ?
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...nt/local/utils/nativefst/ByteSequenceIterator.java 0.00% <0.00%> (ø)
...ment/local/utils/nativefst/ConstantArcSizeFST.java 0.00% <0.00%> (ø)
...pache/pinot/segment/local/utils/nativefst/FST.java 0.00% <0.00%> (ø)
.../pinot/segment/local/utils/nativefst/FSTFlags.java 0.00% <0.00%> (ø)
...pinot/segment/local/utils/nativefst/FSTHeader.java 0.00% <0.00%> (ø)
...ot/segment/local/utils/nativefst/FSTTraversal.java 0.00% <0.00%> (ø)
...ot/segment/local/utils/nativefst/ImmutableFST.java 0.00% <0.00%> (ø)
...not/segment/local/utils/nativefst/MatchResult.java 0.00% <0.00%> (ø)
...t/local/utils/nativefst/NativeFSTIndexCreator.java 0.00% <0.00%> (ø)
...nt/local/utils/nativefst/NativeFSTIndexReader.java 0.00% <0.00%> (ø)
... and 1103 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ac49978...0a2ba3f. Read the comment docs.

richardstartin commented 2 years ago

This is looking quite promising to me 👍🏻

atris commented 2 years ago

This is looking quite promising to me 👍🏻

Thank you for reviewing! I will raise an iteration today which fixes your comments

siddharthteotia commented 2 years ago

Shall we reduce the testing/sample text file size? IMO keeping ~1000 words should be good enough. We don't want to increase the repo size too much because of these sample files

+1

Test files seem extremely huge and we should try to get coverage with considerably smaller datasets.

atris commented 2 years ago

Shall we reduce the testing/sample text file size? IMO keeping ~1000 words should be good enough. We don't want to increase the repo size too much because of these sample files

+1

Test files seem extremely huge and we should try to get coverage with considerably smaller datasets.

Fixed, thanks

mayankshriv commented 2 years ago

Are there any remaining items to be taken care of in this PR @atris @siddharthteotia @richardstartin @Jackie-Jiang? IMHO, since this does not impact existing functionality, perhaps we can document the TODOs here and followup?

atris commented 2 years ago

Are there any remaining items to be taken care of in this PR @atris @siddharthteotia @richardstartin @Jackie-Jiang? IMHO, since this does not impact existing functionality, perhaps we can document the TODOs here and follow up?

@richardstartin kindly reviewed the PR and his suggestions have been addressed. At this point, I have no open action items on the PR.