cedadev / archive-opensearch

Prototype Opensearch Application for the CEDA Archive
0 stars 0 forks source link

Current paging solution > 10k is not RESTful #109

Open rsmith013 opened 3 years ago

rsmith013 commented 3 years ago

how can we make random access pagination work on top of elasticsearch?

rsmith013 commented 3 years ago

Perhaps making use of:

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-request-scroll.html#sliced-scroll

rsmith013 commented 3 years ago

I have been looking this week at possible solutions for random, deep pagination (past the 10,000th results) and I have not been able to come up with a solution that provides timely responses.

I had thought that I could build a cache, indeed, random pagination is possible from a cached response very quickly, the challenge comes if the query response is not cached and the cache has to be built.

Some simple analysis for building these cache objects:

Number of items in the dataset: Min: 10275 Median: 29805.0 Max: 847907

Processing time (based on building cache for a dataset with 59,700 items, pages sizes of 1,000 with processing time = 6.23 * number_of_pages + 5.37): Min: 0:01:09 Median: 0:03:11 Max: 1:28:07

The lower times might be acceptable as a one-off, which will then allow parallelised workflows to interact with the whole result set and reduced subsequent response times, but the upper end clearly is not.

A cache might be useful, more generally, if your workflows often repeat the same query to the same endpoints. This will require minimal engineering but might still provide a useful improvement.

In my research around the subject, it seems the answer to deep pagination is that you don’t. Instead you:

To help me figure out what is the next step, please may you answer the following questions:

rsmith013 commented 3 years ago

Spacebel use an extra parameter in the GET request to pass state. This parameter is passed in the next prev links in the response. This means that the client doesn't need a session.

rsmith013 commented 3 years ago

Can use base64 encoding and decoding to convert the elasticsearch sort key into a string which can be sent in the URL.

import json
import base64

# send with response
sort_key = response['sort']
sort_b = json.dumps(sort_key).encode('utf-8)
b64 = base64.encode(sort_b)

# process with request
search_after = request.GET['search_after']
sort_b = base64.decode(search_after)
sort_key = json.loads(sort_b)