Materials-Consortia / optimade-gateway

A gateway for querying OPTIMADE APIs.
https://www.optimade.org/optimade-gateway
MIT License
5 stars 1 forks source link

Pagination & Sorting #66

Open CasperWA opened 3 years ago

CasperWA commented 3 years ago

As described in the README here, pagination and sorting are issues that needs attention.

Pagination

Priority: MUST be solved

This should be solved, either via response caching or a query algorithm that ensures entries from any gateway databases are not overlooked, even for varying lengths of available entries in the databases.

Sorting

Priority: MAY be solved

This I consider secondary and not a MUST, since sorting is not a required property of OPTIMADE implementations. However, since it is a very desirable property when using this package as a backend for a client, it still has a high priority. However, this cannot be implemented until the pagination issue has been solved. Furthermore, I think sorting should be disabled for attributes fields if just a single gateway database does not support sorting.

A possible implementation pathway could be to query the relevant /info/<entry> endpoint to collect all sortable fields and only offer sorting on the sortable fields common to all the gateway's databases. This should then be reflected in the gateway's own /info/<entry> endpoint.

CasperWA commented 3 years ago

Concerning pagination, I have found the following StackExchange post, which sheds some light on possible solutions and issues/pit falls for this problem. Since my original idea is similar to the very hacky solution mentioned in the accepted answer, I am hesitant to do it this way, and am now considering whether it should be dropped, and instead we could think to a different way of delivering the query results, split into the different databases.

E.g., now the results are updated by having the database's id prepended to all of the results' id. So a resulting id of test_123 becomes DB-id/test_123, i.e., the database's id and the result's id joined together with a forward slash (/) in between. This could instead be changed in two ways:

  1. Make the response's top-level data field a JSON object (Python dictionary), where the keys are the database ids and the values are the corresponding result from each of the databases, including an empty list; or
  2. Add a meta value to each result that states its corresponding database id.

Pros:

  1. JSON object value for data field.
    • No need to change or in any other way reason about the resulting data from each database - just copy-paste.
    • Quick and easy way to get all results from a specific database.
    • Can result in a mock-dynamic response, since each database's response can be added in parallel to the response without interference.
  2. Database id in results' meta field.
    • The API response is still fully compliant with the OPTIMADE specification, which expects a list as the value for data.
    • Does not interfere with the actual data, only adds additional, disposable metadata.

Cons:

  1. JSON object value for data field.
    • Breaks the OPTIMADE specification for how a multiple entry-listing endpoint should look, since the data field value is expected to only be a list.
  2. Database id in results' meta field.
    • Need to load in and handle each and every returned result from all databases for each response (same as now).
    • Cannot do a mock-dynamics response, as each packet of results needs to be added to the same container value (same as now).
    • Slightly more difficult to divide the results into each database after the fact, since one would have to go through all of the individual results and sort them (same as now).

What do you think @csadorf, @giovannipizzi or @ml-evs?

CasperWA commented 3 years ago

One of my original ideas was for each new gateway to initialize by always doing a single request to each database, storing the meta.data_available, if present (it's not a required or even SHOULD-level meta field), and calculate which range of results should be requested from each database whenever a page_* query parameter was used. The main down-side being that it's slightly hacky and needs to be well tested for all edge-cases, such as requesting a too large page number or similar. It would also mean a slight decrease in responsivity when using pagination - which is bad as it will always be used in clients.

CasperWA commented 3 years ago

Another solution found while doing a bit of digging: https://github.com/chrisvxd/combine-pagination#framed-range-intersecting

CasperWA commented 3 years ago

After some more perusing various sites and solutions, I think perhaps there are two options:

I guess, I'll try to implement the latter, to provide the gateway with some sort of pagination.

csadorf commented 3 years ago

I think when we originally discusses this I envisioned an approach where we just merge the results from multiple providers into the same page and then return the page once all sources are dry or the page is full. Is that approach not workable? It means that results are mixed between data sources, but I believe that is 100% appropriate for the gateway until a specific sorting key is specified. In this approach the pagination of each individual data source is an implementation detail.

giovannipizzi commented 3 years ago

I will give just some general comments as I don't know the current implementation good enough (happy to discuss in person on the design, probably it's faster). If at all possible, I would avoid to wait for all results before returning them: even just one provider being down or just slow will make all of them slow. Users instead might expect to start getting the first results as soon as the fastest provider reports them; then a GUI would reorder dynamically results while they arrive (and assign them to the correct provider).

Then, it becomes a question of the GUI (very important though for the design, so it would be good to have a clear idea of how this is supposed to work) if results are grouped by provider (and you ask the same page length to each of them, and show results while they arrive), or if you mix them while they arrive, and then you need to indicate to the user that there are still more results that might be coming (e.g. 1) a progress indicator to show the % of providers that replied, and 2) some indication that there might be "intermediate" results missing).

It would be good to check e.g. providers like kayak or sky scanner, how they present this in their web guis, as they have thought hard at how to make this efficient (even if I think they cache results) and easy to understand.

I guess a good approach (for sorting) would be, say, to ask for the first 10 results from each DB, and show them, with some graphical indicator that there might be more in between and allow to click a button to fetch the next page of 10 more from the DB(s) that might have additional intermediate results.

I don't think it's critical to have an exact pagination of the very final results (but of course you can, e.g. show the first 50 of all the 10*(num providers) you fetched. But if the order can change, it's better not to "promise" a final GUI page length, but just show it dynamically.

Of course, all of this depends on the technology to transfer results to the GUI - to do what I suggest, there should be a way to ask "real-time" if there are more results (e.g. create a "session" or "token" per request, and the gateway keeps retrieving data in the backend, and the client can check often if there are new results in that session/token via AJAX or similar).

On the other hand, if the reply can be compliant with OPTIMADE that would be great (and I wouldn't worry if this requires minor changes to the API, we can discuss and extend it if important). However, I would prefer a slightly different API (but ideally without having to "modify" the data that has to be forwarded by the gateway except, as mentioned, but minor additional metadata additions), if this gives an improved user experience (speed and clarify) when searching.

CasperWA commented 3 years ago

I think when we originally discusses this I envisioned an approach where we just merge the results from multiple providers into the same page and then return the page once all sources are dry or the page is full. Is that approach not workable? It means that results are mixed between data sources, but I believe that is 100% appropriate for the gateway until a specific sorting key is specified. In this approach the pagination of each individual data source is an implementation detail.

That would still result in queries where, e.g., the query parameter page_limit does not align with the results of a gateway.

Edit: But if we don't promise that (as also seems to be suggested by @giovannipizzi), then it should be fine, of course.

CasperWA commented 3 years ago

I will give just some general comments as I don't know the current implementation good enough (happy to discuss in person on the design, probably it's faster). If at all possible, I would avoid to wait for all results before returning them: even just one provider being down or just slow will make all of them slow. Users instead might expect to start getting the first results as soon as the fastest provider reports them; then a GUI would reorder dynamically results while they arrive (and assign them to the correct provider).

Then, it becomes a question of the GUI (very important though for the design, so it would be good to have a clear idea of how this is supposed to work) if results are grouped by provider (and you ask the same page length to each of them, and show results while they arrive), or if you mix them while they arrive, and then you need to indicate to the user that there are still more results that might be coming (e.g. 1) a progress indicator to show the % of providers that replied, and 2) some indication that there might be "intermediate" results missing).

It would be good to check e.g. providers like kayak or sky scanner, how they present this in their web guis, as they have thought hard at how to make this efficient (even if I think they cache results) and easy to understand.

I think this gateway cannot handle this in any way, as this is usually an implementation done in JS or with CORS, i.e., directly in the browser. It might be able to mock it, but the design for the query-result delivery would have to change (drastically), I think.

I guess a good approach (for sorting) would be, say, to ask for the first 10 results from each DB, and show them, with some graphical indicator that there might be more in between and allow to click a button to fetch the next page of 10 more from the DB(s) that might have additional intermediate results.

I don't think it's critical to have an exact pagination of the very final results (but of course you can, e.g. show the first 50 of all the 10*(num providers) you fetched. But if the order can change, it's better not to "promise" a final GUI page length, but just show it dynamically.

Right. So this is the idea of implementing something similar to https://github.com/chrisvxd/combine-pagination#framed-range-intersecting or DynamoDB.

Of course, all of this depends on the technology to transfer results to the GUI - to do what I suggest, there should be a way to ask "real-time" if there are more results (e.g. create a "session" or "token" per request, and the gateway keeps retrieving data in the backend, and the client can check often if there are new results in that session/token via AJAX or similar).

On the other hand, if the reply can be compliant with OPTIMADE that would be great (and I wouldn't worry if this requires minor changes to the API, we can discuss and extend it if important). However, I would prefer a slightly different API (but ideally without having to "modify" the data that has to be forwarded by the gateway except, as mentioned, but minor additional metadata additions), if this gives an improved user experience (speed and clarify) when searching.

csadorf commented 3 years ago

I think when we originally discusses this I envisioned an approach where we just merge the results from multiple providers into the same page and then return the page once all sources are dry or the page is full. Is that approach not workable? It means that results are mixed between data sources, but I believe that is 100% appropriate for the gateway until a specific sorting key is specified. In this approach the pagination of each individual data source is an implementation detail.

That would still result in queries where, e.g., the query parameter page_limit does not align with the results of a gateway.

I don't understand this answer. The gateway acts as OPTIMADE provider. I tell the gateway I want to get a maximum of 50 results per page. As a user I at first don't care where those results are coming from, I just want the first 50 results. What is not aligning here?

CasperWA commented 3 years ago

I think when we originally discusses this I envisioned an approach where we just merge the results from multiple providers into the same page and then return the page once all sources are dry or the page is full. Is that approach not workable? It means that results are mixed between data sources, but I believe that is 100% appropriate for the gateway until a specific sorting key is specified. In this approach the pagination of each individual data source is an implementation detail.

That would still result in queries where, e.g., the query parameter page_limit does not align with the results of a gateway.

I don't understand this answer. The gateway acts as OPTIMADE provider. I tell the gateway I want to get a maximum of 50 results per page. As a user I at first don't care where those results are coming from, I just want the first 50 results. What is not aligning here?

Because the results from each database comes in bulk, not one-by-one. So one would still have to wait for all results to come in and then afterwards apply a mix-sorting, and then finally cut-off any unwanted entries.

csadorf commented 3 years ago

Because the results from each database comes in bulk, not one-by-one

Why are they coming in bulk? We can request paginated results internally, can we not?

CasperWA commented 3 years ago

Because the results from each database comes in bulk, not one-by-one

Why are they coming in bulk? We can requests paginated results internally, can we not?

Sure. This would result in either many small external calls to retrieve single or couples of entries, which is quite ineffecient. And how would you also divide up results for various databases? Do you only return the 50 fastest results, while the user is expecting a mix of all databases when possible, if the number of entries don't exactly align when dividing the requested limit with number of databases, how do you determine who to in/-exclude?

I think the consequences of your first comment and @giovannipizzi's comment is rather that the gateway shouldn't "promise" a returned number of entries.

CasperWA commented 3 years ago

In any case @csadorf, I think what you're describing is similar to the answer here. If I understand correctly? And this might be the solution that makes the most sense.

csadorf commented 3 years ago

Like @giovannipizzi pointed out, it is important that we return results as they arrive (first-come-first-serve) to not slow down our response to the slowest data source of the mix or potentially even stall completely. The user expects that the results come from all sources, but unless they specify a specific ordering, there can be no expectation that they are in some sense well-mixed. Eventually the complete results set should contain all valid results from all sources, but if paginated, there can be no such expectation.

This means that we should indeed make smaller requests to the individual sources and then start filling our pages immediately. I am not sure why that would be inefficient? It just means that we make multiple smaller requests at once. We can even specify a minimum page limit that we divide our page into, meaning with a page size of N and M data sources, the individual page limit for each data source is max(PAGE_SIZE_MIN, N//M) The PAGE_SIZE_MIN could be 50 or so to ensure that we don't make too small requests. Ideally all sources are roughly equally fast and deliver their pages at similar speeds, but if they don't, that's fine, we just fill up our own pages as we go.

csadorf commented 3 years ago

In any case @csadorf, I think what you're describing is similar to the answer here. If I understand correctly? And this might be the solution that makes the most sense.

I guess it's a similar problem, yes.