Add special response/error for case where deep pagination is not accessible

ml-evs commented 1 year ago

Some database backends have limitations on the depth of pagination into a filter, for example Elasticsearch by default does not allow random access to pages of results beyond the first 10000 results.

For example, querying for the 10001th result in NOMAD for a given filter "nsites > 0":

"errors": [
    {
         "status":"400",
         "title":"RequestError",
         "detail":"RequestError(400, 'search_phase_execution_exception', 'Result window is too large, from + size must be less than or equal to: [10000] but was [10020]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.')"
    }
]

One typical workaround here is to specify a sort field (say, ID) and direction, then begin a new query that filters out the results from the first page by querying for id > <last_id_on_prev_page>.

Is this problem ubiquitous enough that we could add a specific error type to the specification that clients could employ in order to apply client-side workarounds to this problem?

One issue to mention is the potential for data inconsistency between queries, but we don't concern ourselves with that in the spec atm...

rartino commented 1 year ago

Since there seems to exist a general solution for this problem, is there a reason to put it on clients to deal with it?

Couldn't those with backends that work like just rewrite the next link so it works? It would require them to always impose an outermost sort order for results, e.g. on 'id' even if the client doesn't ask for it, and after N number of pages rework the next link to incorporate the condition id > <last id> and reset the offset used to 0?

If I'm not misremembering this, the next link doesn't have to be interpretable as a standard OPTIMADE query, one is free to inject ones own query parameters to guide the paging. I.e., just append separate query parameters &_exmpl_min_id=42&_exmpl_offset_page=8 and ignore the regular page if they are present.

About data consistency: OPTIMADE doesn't impose any requirements here (but maybe it should), so I expect many providers not to handle this consistently even when their paging "works". I suspect it can never be a MUST requirement to handle both paging of huge result sets and data consistency, but if your database can do it, you can of course also append &_exmpl_data_state=2023-02-07T-15:50+00 to the next link and keep it even as you "re-page" the query.

ml-evs commented 1 year ago

Since there seems to exist a general solution for this problem, is there a reason to put it on clients to deal with it?

It's not so much "putting it on them" but making it possible full stop (in an automatic way across databases), which I think requires the error.

Couldn't those with backends that work like just rewrite the next link so it works? It would require them to always impose an outermost sort order for results, e.g. on 'id' even if the client doesn't ask for it, and after N number of pages rework the next link to incorporate the condition id > <last id> and reset the offset used to 0?

I guess they could, but we either enforce that you should always be able to get all data for a query through next links or we provide a special error message that allows providers to opt out and lets the clients handle it how they please.

Markus has been responding in https://github.com/nomad-coe/nomad/issues/45 about ways we can get around the root problem in ES.

rartino commented 1 year ago

I don't see any reason to object against a "new" error code as long as there is one provider who says "this would be useful for us", so feel free to PR this if this is the case.

Nevertheless, to "make it possible" to use for clients, you still need the database to implement the error code. When they do that, I'd think most databases would realize they could as well fix the pagination issue in their implementation so the client doesn't have to bother with that error code.

Markus has been responding in https://github.com/nomad-coe/nomad/issues/45 about ways we can get around the root problem in ES.

So, it seems in this case the change may just be to swap out what ES feature is used for paging the results, so it seems at least for that implementation, no error code will be needed.

ml-evs commented 1 year ago

I don't see any reason to object against a "new" error code as long as there is one provider who says "this would be useful for us", so feel free to PR this if this is the case.

Okay, lets see how it goes then.

Quick follow-up Q before I close this as I'm now implementing page_below/above in optimade-python-tools -- the spec currently says:

fetch up to 100 structures above sort-field value 4000 (in this example, server chooses to fetch results sorted by increasing id, so page_above value refers to an id value): /structures?page_above=4000&page_limit=100.

I think the page_above value should be in quotes (as pointed about by @JPBergsma) as the value is not an identifier (although it is an id... confusingly!). If we agree I'll make a PR.

rartino commented 1 year ago

... no?

These are REST query parameters. They are always "strings" when they arrive through your framework's handling of GET parameters, so quoting them would be strange, it would mean you specify the page to be the string "4000" including the quotes.

ml-evs commented 1 year ago

These are REST query parameters. They are always "strings" when they arrive through your framework's handling of GET parameters, so quoting them would be strange, it would mean you specify the page to be the string "4000" including the quotes.

But page_above does not refer to a page number, it refers to a database ID, e.g., page_above=mp-102, my concern is that IDs are allowed to have non-URL-encodable chars that could break URLs here unless they are quoted, e.g., page_above=odbx/1.

rartino commented 1 year ago

non-URL-encodable

You mean non-url safe? As far as I know, all characters can be represented via url-encoding, and are then url-decoded as part of resolving the GET parameters. Enclosing the URL query parameters with quotes will neither help or prevent that; you'll just end up with a string with quotes around it in the end.

ml-evs commented 1 year ago

non-URL-encodable

You mean non-url safe? As far as I know, all characters can be represented via url-encoding, and are then url-decoded as part of resolving the GET parameters. Enclosing the URL query parameters with quotes will neither help or prevent that; you'll just end up with a string with quotes around it in the end.

Yes, sorry, sloppy language. I'm not sure I'm making myself clear. Its only the value that needs to be in quotes, in the same way that it would be for a filter query, e.g.,

?page_above="odbx/1" w.r.t. ?filter=id="odbx/1"...

I don't think there is anything stopping an ID containing e.g., & (and in fact some random IDs might) so my_id&response_fields=cartesian_site_positions would be a valid OPTIMADE id... so the automatically generated pagination link for this would be ?page_above=my_id&response_fields=cartesian_site_positions... I'm not sure how servers are meant to deal with this in a sane way.

Likewise, we seem to be saying the definition above that the page_above value can be relative to whatever the server wants (i.e., a default sort, or a provided one) so we are implicitly constraining all string fields served OPTIMADE to be URL safe, e.g., if someone decided to index over their own _custom_id field that already contains & etc. then they would not be able to use value based pagination.

rartino commented 1 year ago

I think it may just be me lousy at explaining :-)

A properly URL encoded version of the query part of a REST request communicating that the parameter page_above should be set to this is&a strange/id and response_fields should be set to cartesian_site_positions is:

`?page_above=this%20is%26a%20strange%2Fid&response_fields=cartesian_site_positions

A reasonable REST framework will url-decode this and have no trouble parsing it into some form of GET key-value data structure corresponding to the following JSON:

{ 
  "page_above": "this is&a strange/id",
  "response_fields": "cartesian_site_positions"
}

If you instead put the query parameter value in quotes before you URL-encode the request, the URL-encoded version is:

?page_above=%22this%20is%26a%20strange%2Fid%22&response_fields=cartesian_site_positions

and the REST framework would parse it into (also JSON, with \ escaping the double quotes present in the string):

{ 
  "page_above": "\"this is&a strange/id\"",
  "response_fields": "cartesian_site_positions"
}

which seems to provide little benefit over not including the quotes.

ml-evs commented 1 year ago

Right, I'm with you now, thanks for the full explanation. I have been handcrafting queries for too long to appreciate that & would not be valid but %26% is...

JPBergsma commented 1 year ago

I find this very strange. When I use a filter, I have to use quotes like: filter=chemical_formula_reduced="F2O" But when I use the same property in combination with page_above I should not use quotes ? e.g. ?sort=chemical_formula_reduced, page_limit=20, page_above=F2O

ps. Technicaly we could do it as described by @rartino, but I think it would be inconsistent to use one kind of formatting for one ~field~ query parameter and a different kind of formatting for another ~~field~~ query parameter.

rartino commented 1 year ago

The contexts and respective left hand sides in your examples refer to different things. In the filter case, the identifiers are property names, which refer to things with data types; whereas HTTP GET parameter names usually aren't seen as things with data types ("they are all strings").

Sure, an API specification could still decide to say that the REST parameters are to be quoted. The outcome is that clients need to do an unnecessary extra layer of escaping (page_above="id-with-a-double-quote-\"-for-some-reason") to be unescaped on the server side, for no practical benefit.

By the way, for OPTIMADE, wouldn't this also mean the whole filter expression has to be enclosed in quotes? So, then one needs to escape the quotes inside the filter string, and double escape (i.e., three backslashes) double quotes inside strings inside the filter expression. I can't believe anyone wants that.

Anyway, aside all this, we already defined - at least via examples - these things to not have quotes. Enforcing quotes now is a client breaking change (widely so for the filter, unless you mean that is for some reason exempt from quoting?), so, if you genuinely mean to propose this behavior, it would IMO have to be for OPTIMADE V2.

JPBergsma commented 1 year ago

I am not sure if any of the current clients are using page_above or page_below, but if they do, it would indeed be a breaking change.
Perhaps we can still clarify the example by throwing some characters in the id. Other than that, we do not need to take any action here.

Materials-Consortia / OPTIMADE

Add special response/error for case where deep pagination is not accessible #458