apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.4k stars 1.26k forks source link

Pagination Support #5246

Open harshinielath opened 4 years ago

harshinielath commented 4 years ago

Right now, pagination only works with selection queries with order by

SELECT foo, bar FROM myTable
  WHERE baz > 20
  ORDER BY bar DESC
  LIMIT 50, 100

Paginate the selection results from the ‘a’ th results and return at most ‘b’ results.

The below query doesn't paginate the results

SELECT count(*), foo, bar FROM myTable
  WHERE baz > 20
 GROUP BY foo
  ORDER BY bar DESC
  LIMIT 50, 100
asockalingam commented 2 years ago

will this be supported in the future?

kishoreg commented 2 years ago

Yes, it's a bit challenging but we have received this request multiple times. Please consider upvoting the feature

lfernandez93 commented 2 years ago

@kishoreg how do we up vote this issue?

lfernandez93 commented 2 years ago

+1 is there any update on this?

asockalingam commented 2 years ago

Yes, it's a bit challenging but we have received this request multiple times. Please consider upvoting the feature

where do we upvote?

atris commented 2 years ago

@kishoreg as discussed offline, I will take this up

gracewalkuski commented 2 years ago

+1 Any update on this? We need it!

siddharthteotia commented 2 years ago

At LinkedIn, we have started to work on pagination on priority considering multiple requests we have received internally.

At a high level, our customer requirements are around the fact that they want to run query in Pinot that can potentially return a large response and users want the ability to paginate the response as multiple result sets (size per result set dictated by the user app).

The current pagination implementation in Pinot (even if it is just for selection query) is sub-optimal in the sense that it takes each query as a fresh query and executes the query again and again for every pagination window, discard the results outside the window and provides the result within the M, N window that user has asked for.

The main thing to note about pagination is that it has to be treated as a single query. In case of our customers, they don't want to run a one-off pagination query OFFSET M, FETCH N where M and N are completely random in which case it is not possible to reason about the results and it's even hard for the user to decide M as a one-off starting point. Result of a random pagination query doesn’t add any value to the user since they want to look at the entire result as a continuous stream of results / pages / batches with the will to stop anytime.

So, the semantics that we want to provide is that "I want to fetch 10 million records from Pinot for a query and want to fetch 100K at a time". The customer will typically start with M as 0 and might just keep N fixed (say at 10K or 100K etc) and just keep paging the results through multiple calls from their app which simply changes M during every call (and they potentially refresh the results in UI etc returned by Pinot in every call).

I think we should look at the pagination problem from this perspective as opposed to a random one-off pagination query. We are trying to tackle the problem from this angle. Detail design discussion is in progress.

Some more thoughts slightly related to this --

Now, one problem is that users who run such queries may have the tendency to think that support for pagination means they can run "any" query in Pinot that can be very long running and Pinot is guaranteed to finish it and provide results. This can easily cause OOM (out of memory) and bring down the cluster.

Pinot is unlikely to enter the territory of running very long running queries and getting the entire 100% accurate result by spilling to disk and avoiding OOM at all times. Presto should be used for those cases.

However, for some of our users (who are ok with multi-second latency and prefer slightly more accurate response for GROUP BY queries), as a follow-up / next phase, we want to consider enhancing support in Pinot for queries that return large responses and/or process / aggregate more than usual amounts of data. We want to do this by doing some of the memory intensive query execution operations in off-heap (direct) memory. This along with the ability to paginate a large response back sort of fulfills the requirements we are seeing in production.

siddharthteotia commented 2 years ago

@gracewalkuski - I am not fully aware of your requirements but please take a look at my comment above and let us know your feedback.

gracewalkuski commented 2 years ago

Hi @siddharthteotia Your comment looks good to me as our requirements are very similar. I will add - currently, we use the JDBC client to access our pinot databases via a spring application, but if we could get results paged directly from pinot to postgres (what we use for interim storage), that would meet our needs as well.

jackjlli commented 1 year ago

Hi all,

This is the proposed design doc on pagination support: https://docs.google.com/document/d/1sNch7Ck-H06De2mb0Bj67ajEBMawrLJn_MxNzDEIxPU/edit?usp=sharing

Please take a look at it and feel free to put any comments. Thanks!

Best, Jack

siddharthteotia commented 1 year ago

cc @walterddr ^^ @Jackie-Jiang

mayankshriv commented 1 year ago

Hi @jackjlli, wondering if you are still working on this and if so, any timeline for the same?

siddharthteotia commented 1 year ago

hey @mayankshriv - yes we are working on this.

mayankshriv commented 1 year ago

Hey @siddharthteotia any update on this feature?

hardikbhut333 commented 12 months ago

@mayankshriv - Any update on this feature? Your help will be really appreciated.

raghavmalpanidd commented 1 month ago

any update here?

monicaluodialpad commented 1 day ago

Any update here?

Jackie-Jiang commented 1 day ago

cc @vrajat