Open CasperWA opened 3 years ago
Concerning pagination, I have found the following StackExchange post, which sheds some light on possible solutions and issues/pit falls for this problem. Since my original idea is similar to the very hacky solution mentioned in the accepted answer, I am hesitant to do it this way, and am now considering whether it should be dropped, and instead we could think to a different way of delivering the query results, split into the different databases.
E.g., now the results are updated by having the database's id
prepended to all of the results' id
. So a resulting id
of test_123
becomes DB-id/test_123
, i.e., the database's id
and the result's id
joined together with a forward slash (/
) in between.
This could instead be changed in two ways:
data
field a JSON object (Python dictionary), where the keys are the database id
s and the values are the corresponding result from each of the databases, including an empty list; ormeta
value to each result that states its corresponding database id
.Pros:
data
field.
id
in results' meta
field.
data
.Cons:
data
field.
data
field value is expected to only be a list.id
in results' meta
field.
What do you think @csadorf, @giovannipizzi or @ml-evs?
One of my original ideas was for each new gateway to initialize by always doing a single request to each database, storing the meta.data_available
, if present (it's not a required or even SHOULD-level meta
field), and calculate which range of results should be requested from each database whenever a page_*
query parameter was used. The main down-side being that it's slightly hacky and needs to be well tested for all edge-cases, such as requesting a too large page number or similar. It would also mean a slight decrease in responsivity when using pagination - which is bad as it will always be used in clients.
Another solution found while doing a bit of digging: https://github.com/chrisvxd/combine-pagination#framed-range-intersecting
After some more perusing various sites and solutions, I think perhaps there are two options:
page_limit
and M is number of databases in the gateway; orI guess, I'll try to implement the latter, to provide the gateway with some sort of pagination.
I think when we originally discusses this I envisioned an approach where we just merge the results from multiple providers into the same page and then return the page once all sources are dry or the page is full. Is that approach not workable? It means that results are mixed between data sources, but I believe that is 100% appropriate for the gateway until a specific sorting key is specified. In this approach the pagination of each individual data source is an implementation detail.
I will give just some general comments as I don't know the current implementation good enough (happy to discuss in person on the design, probably it's faster). If at all possible, I would avoid to wait for all results before returning them: even just one provider being down or just slow will make all of them slow. Users instead might expect to start getting the first results as soon as the fastest provider reports them; then a GUI would reorder dynamically results while they arrive (and assign them to the correct provider).
Then, it becomes a question of the GUI (very important though for the design, so it would be good to have a clear idea of how this is supposed to work) if results are grouped by provider (and you ask the same page length to each of them, and show results while they arrive), or if you mix them while they arrive, and then you need to indicate to the user that there are still more results that might be coming (e.g. 1) a progress indicator to show the % of providers that replied, and 2) some indication that there might be "intermediate" results missing).
It would be good to check e.g. providers like kayak or sky scanner, how they present this in their web guis, as they have thought hard at how to make this efficient (even if I think they cache results) and easy to understand.
I guess a good approach (for sorting) would be, say, to ask for the first 10 results from each DB, and show them, with some graphical indicator that there might be more in between and allow to click a button to fetch the next page of 10 more from the DB(s) that might have additional intermediate results.
I don't think it's critical to have an exact pagination of the very final results (but of course you can, e.g. show the first 50 of all the 10*(num providers) you fetched. But if the order can change, it's better not to "promise" a final GUI page length, but just show it dynamically.
Of course, all of this depends on the technology to transfer results to the GUI - to do what I suggest, there should be a way to ask "real-time" if there are more results (e.g. create a "session" or "token" per request, and the gateway keeps retrieving data in the backend, and the client can check often if there are new results in that session/token via AJAX or similar).
On the other hand, if the reply can be compliant with OPTIMADE that would be great (and I wouldn't worry if this requires minor changes to the API, we can discuss and extend it if important). However, I would prefer a slightly different API (but ideally without having to "modify" the data that has to be forwarded by the gateway except, as mentioned, but minor additional metadata additions), if this gives an improved user experience (speed and clarify) when searching.
I think when we originally discusses this I envisioned an approach where we just merge the results from multiple providers into the same page and then return the page once all sources are dry or the page is full. Is that approach not workable? It means that results are mixed between data sources, but I believe that is 100% appropriate for the gateway until a specific sorting key is specified. In this approach the pagination of each individual data source is an implementation detail.
That would still result in queries where, e.g., the query parameter page_limit
does not align with the results of a gateway.
Edit: But if we don't promise that (as also seems to be suggested by @giovannipizzi), then it should be fine, of course.
I will give just some general comments as I don't know the current implementation good enough (happy to discuss in person on the design, probably it's faster). If at all possible, I would avoid to wait for all results before returning them: even just one provider being down or just slow will make all of them slow. Users instead might expect to start getting the first results as soon as the fastest provider reports them; then a GUI would reorder dynamically results while they arrive (and assign them to the correct provider).
Then, it becomes a question of the GUI (very important though for the design, so it would be good to have a clear idea of how this is supposed to work) if results are grouped by provider (and you ask the same page length to each of them, and show results while they arrive), or if you mix them while they arrive, and then you need to indicate to the user that there are still more results that might be coming (e.g. 1) a progress indicator to show the % of providers that replied, and 2) some indication that there might be "intermediate" results missing).
It would be good to check e.g. providers like kayak or sky scanner, how they present this in their web guis, as they have thought hard at how to make this efficient (even if I think they cache results) and easy to understand.
I think this gateway cannot handle this in any way, as this is usually an implementation done in JS or with CORS, i.e., directly in the browser. It might be able to mock it, but the design for the query-result delivery would have to change (drastically), I think.
I guess a good approach (for sorting) would be, say, to ask for the first 10 results from each DB, and show them, with some graphical indicator that there might be more in between and allow to click a button to fetch the next page of 10 more from the DB(s) that might have additional intermediate results.
I don't think it's critical to have an exact pagination of the very final results (but of course you can, e.g. show the first 50 of all the 10*(num providers) you fetched. But if the order can change, it's better not to "promise" a final GUI page length, but just show it dynamically.
Right. So this is the idea of implementing something similar to https://github.com/chrisvxd/combine-pagination#framed-range-intersecting or DynamoDB.
Of course, all of this depends on the technology to transfer results to the GUI - to do what I suggest, there should be a way to ask "real-time" if there are more results (e.g. create a "session" or "token" per request, and the gateway keeps retrieving data in the backend, and the client can check often if there are new results in that session/token via AJAX or similar).
On the other hand, if the reply can be compliant with OPTIMADE that would be great (and I wouldn't worry if this requires minor changes to the API, we can discuss and extend it if important). However, I would prefer a slightly different API (but ideally without having to "modify" the data that has to be forwarded by the gateway except, as mentioned, but minor additional metadata additions), if this gives an improved user experience (speed and clarify) when searching.
I think when we originally discusses this I envisioned an approach where we just merge the results from multiple providers into the same page and then return the page once all sources are dry or the page is full. Is that approach not workable? It means that results are mixed between data sources, but I believe that is 100% appropriate for the gateway until a specific sorting key is specified. In this approach the pagination of each individual data source is an implementation detail.
That would still result in queries where, e.g., the query parameter
page_limit
does not align with the results of a gateway.
I don't understand this answer. The gateway acts as OPTIMADE provider. I tell the gateway I want to get a maximum of 50 results per page. As a user I at first don't care where those results are coming from, I just want the first 50 results. What is not aligning here?
I think when we originally discusses this I envisioned an approach where we just merge the results from multiple providers into the same page and then return the page once all sources are dry or the page is full. Is that approach not workable? It means that results are mixed between data sources, but I believe that is 100% appropriate for the gateway until a specific sorting key is specified. In this approach the pagination of each individual data source is an implementation detail.
That would still result in queries where, e.g., the query parameter
page_limit
does not align with the results of a gateway.I don't understand this answer. The gateway acts as OPTIMADE provider. I tell the gateway I want to get a maximum of 50 results per page. As a user I at first don't care where those results are coming from, I just want the first 50 results. What is not aligning here?
Because the results from each database comes in bulk, not one-by-one. So one would still have to wait for all results to come in and then afterwards apply a mix-sorting, and then finally cut-off any unwanted entries.
Because the results from each database comes in bulk, not one-by-one
Why are they coming in bulk? We can request paginated results internally, can we not?
Because the results from each database comes in bulk, not one-by-one
Why are they coming in bulk? We can requests paginated results internally, can we not?
Sure. This would result in either many small external calls to retrieve single or couples of entries, which is quite ineffecient. And how would you also divide up results for various databases? Do you only return the 50 fastest results, while the user is expecting a mix of all databases when possible, if the number of entries don't exactly align when dividing the requested limit with number of databases, how do you determine who to in/-exclude?
I think the consequences of your first comment and @giovannipizzi's comment is rather that the gateway shouldn't "promise" a returned number of entries.
In any case @csadorf, I think what you're describing is similar to the answer here. If I understand correctly? And this might be the solution that makes the most sense.
Like @giovannipizzi pointed out, it is important that we return results as they arrive (first-come-first-serve) to not slow down our response to the slowest data source of the mix or potentially even stall completely. The user expects that the results come from all sources, but unless they specify a specific ordering, there can be no expectation that they are in some sense well-mixed. Eventually the complete results set should contain all valid results from all sources, but if paginated, there can be no such expectation.
This means that we should indeed make smaller requests to the individual sources and then start filling our pages immediately. I am not sure why that would be inefficient? It just means that we make multiple smaller requests at once. We can even specify a minimum page limit that we divide our page into, meaning with a page size of N and M data sources, the individual page limit for each data source is max(PAGE_SIZE_MIN, N//M)
The PAGE_SIZE_MIN
could be 50 or so to ensure that we don't make too small requests. Ideally all sources are roughly equally fast and deliver their pages at similar speeds, but if they don't, that's fine, we just fill up our own pages as we go.
As described in the README here, pagination and sorting are issues that needs attention.
Pagination
Priority: MUST be solved
This should be solved, either via response caching or a query algorithm that ensures entries from any gateway databases are not overlooked, even for varying lengths of available entries in the databases.
Sorting
Priority: MAY be solved
This I consider secondary and not a MUST, since sorting is not a required property of OPTIMADE implementations. However, since it is a very desirable property when using this package as a backend for a client, it still has a high priority. However, this cannot be implemented until the pagination issue has been solved. Furthermore, I think sorting should be disabled for
attributes
fields if just a single gateway database does not support sorting.A possible implementation pathway could be to query the relevant
/info/<entry>
endpoint to collect all sortable fields and only offer sorting on the sortable fields common to all the gateway's databases. This should then be reflected in the gateway's own/info/<entry>
endpoint.