elastic / elasticsearch-dsl-py

High level Python client for Elasticsearch
http://elasticsearch-dsl.readthedocs.org
Apache License 2.0
3.83k stars 801 forks source link

Implement efficient pagination helpers using search_after #802

Open honzakral opened 6 years ago

honzakral commented 6 years ago

Currently any of the pagination needs to be done manually either via slicing (which can be inefficient for deep pagination) or using search_after (0), which can be complex. What I propose is to introduce several new methods on Search objects:

def get_page(self, page_no):
    """
    use slicing to get the `page_no` page and return a response (it will execute your search)
    """

def get_next_page(self, last_hit, step=1):
    """
    use `search_after` and return a response representing page of the response + step
    """

def get_previous_page(self, first_hit, step=1):
    """
    similar to get_next_page but will have to reverse the order first to be able to use search_after
    """

and helper methods on Response to retrieve last_hit and first_hit (self.hits[0/-1].meta.sort) and also to directly use those to call get_next/previous_page.

0 - https://www.elastic.co/guide/en/elasticsearch/reference/6.1/search-request-search-after.html

Or do people think this should be a separate object/module altogether? is there anything I am missing? (number of pages? Direct jump to last/first page?

honzakral commented 6 years ago

All the Search methods have been implemented in #806, with one small exception - there is no step parameter when to get_(next|previous)_page since I didn't realize that _search in elasticsearch doesn't support from when using search_after (I opened a ticket to address that - https://github.com/elastic/elasticsearch/issues/28068).

My question is whether it is OK like this or if we should implement the skip functionality in python - when jumping with skip=3, request 3*SIZE documents and discard first 2*SIZE. It is not the most efficient way of course but avoids deep pagination which is even worse.

What do people think?

drpump commented 6 years ago

While discarding unused docs is ok when skip=3, it wouldn't be OK with skip=500 or skip=last-1 for example. Limiting the number of available pages (@jimczi's suggestion on https://github.com/elastic/elasticsearch/issues/28068) is not necessarily an option when access to the Last page is a legitimate requirement.

Is there a way to resolve this efficiently? I thought that search_after with from would get us at least part way there. I'd be interested in an explanation for why this isn't a good idea.

honzakral commented 6 years ago

@drpump thank you for the reaction!

There is currently no other way to do it efficiently, according to Jim, feel free to ask any questions on the elasticsearch ticket, but from what I understand the elasticsearch team is not super happy with search_after performance and usability and might want to update its internals. That is why they don't want to commit to additional functionality in the feature right now.

last page is not hard because we can invert the sort order (as the PR is doing) to jump there directly to the last page. We could also enforce the maximum number of pages to be skipped, cap it at 10 or some other small arbitrary number...

drpump commented 6 years ago

Thanks, so I have 3 solutions I could implement:

  1. Paginate first N (10,000 or other max) and last N records (reverse search) and throw an exception for those in between.
  2. Use the forward/reverse search with search_after and do a lazy fetch of records in between (i.e. get first N, get last N, get next N if required, get next-to-last N etc). Some accuracy issues, but not significant for a large number of records.
  3. Retrieve all metadata only using scroll API and paginate on the array. In Rails, due to integration with ActiveRecord, I can retrieve each match from my DB rather than going back to ES. Has memory and latency implications for my app, although background fetch would probably make it perform OK. Again, some accuracy issues due to currency of scroll but not significant.

All are client side solutions. I'd need to implement a new searcher class in Rails or monkey patch the elasticsearch gems. Doable if perhaps a bit messy.

shotecorps commented 5 years ago

When using search_after, we need to choose a unique sort key. And there was a little difficulty in choosing. Field _id is not recommended, it's not a doc_value field. And when shard is large, for example close to 50G. Sorting with _id leads to poor performance(comparing to default sort _doc). _doc is also not suitable for sorting either, for it is not unique for each doc.