ga4gh / task-execution-schemas

Apache License 2.0
80 stars 27 forks source link

Pagination support #193

Open JaeAeich opened 1 year ago

JaeAeich commented 1 year ago

The current implementation of the API responses lacks adequate support for comprehensive pagination. The API solely offers a next_page_token as a cursor to navigate to the next page. However, to access the previous page, it necessitates client-side caching of the cursors. Additionally, essential pagination elements such as the total number of tasks and pages, are crucial for effective pagination.

uniqueg commented 11 months ago

Strongly agree with this. The current pagination is really only useful for limiting the number of tasks returned by GET /tasks. For effective pagination in a GUI client it's not very helpful.

Perhaps we could look at the GA4GH pagination style guide that is currently being drafted up by TASC in repsonse to this issue: https://github.com/ga4gh/TASC/issues/29). It lists a couple of solutions and discusses pros and cons.

JaeAeich commented 8 months ago

It'd be nice to see pagination where user can:

Maybe setting up an extra cursor page_token_prev and page_num could be a good place to start. Also the API should allow users to fetch a specific page in that manner, so that user can just lets say jump to page 4 skipping all the middle pages. Apart from that total_results and total_page_number would also be essential, which would allow the user to directly jump to the last page.

uniqueg commented 2 weeks ago

@andrewyatz: Do you think these use cases are covered in the TASC pagination guide?

andrewyatz commented 2 weeks ago

Pagination guide would suggest not using tokens for exactly these reasons and instead paging using counts/offsets. Tokens have been written up to prefer streaming & not that you cannot replay paginated streams (like in Kafka) it is implementation specific.

I'd like to know if this is a "good" solution to the above

uniqueg commented 2 weeks ago

Thanks @andrewyatz.

@JaeAeich: Would either of the two solutions described in the GA4GH pagination guide draft address your use case? If not, please leave feedback :pray:

JaeAeich commented 2 weeks ago

When datasets may change between requests, clients SHOULD be tolerant of skipped or redundant entries between pages and servers SHOULD consider pagination stability requirements when choosing to offer an page-based pagination scheme.

How would client be tolerant to the data its unaware of, ie, lets say I fetched page n, and while fetching page n + 1, the data set changes and an entry gets added to page n, this means I'll have a redundant entry which my client has already seen in page n. We can have a workaround by implementing some kind of cache invalidation, but I think this should be the job of the server.

Page based

total (optional) - The total number of results available in the result set

Client typically show the range of "pages" a specific user will have, ie if I request 5 items per page, I will expect 100/5 = 20 (say total 100) "pages", ie when on page n, pagination will show something like 1,2 ... ->n<- .. 19, 20. To get this I think in this approach client would want to have total items. This way client can calculate offset and properly paginate.

Servers MAY return less than the page_size, even if it is not the last page

I understand this might not be a problem for GUI, as result page_size would never exceed 100, but just outta curiosity... why not a wait till a threshold and error if its too resource intensive?

Token based

Response doesn't have prev_page_token, client would need or will have to cache these token, which shouldn't be its job.


Use case visualization

image

PS: I think page_based would be preferred but would require breaking changes in cloud-component.

andrewyatz commented 2 weeks ago

Thanks for the insights here. It might be better to move some of these to the TASC repo but from the first points here

Data tolerance This is a key issue in any pagination scheme. When researching this I found many technologies attempt to use temporal pagination or paginate through a cached result server side to avoid the issue all together. I think the impact of this depends on how much you can or cannot tolerate repeats in data responses and that does not feel like the pagination's guide to say this

Pages available I didn't see anything in the examples we reviewed which did provide the page count and as you said it's typically calculated by the client. So we didn't include it

prev_page_token The availability of that token would depend on the state the server knows. It might have zero idea about what the previous page was, which is why it was omitted from the recommended returns. If the token was predictable then this is a different matter but we pushed for obfuscated tokens (which of course could be based on known data).

Do you think this is an issue?

uniqueg commented 2 weeks ago

I agree that the first point should be left to implementers and that the second point is unusual and can easily be calculated by the client from the total number of records and the page size.

I also agree that a token for the previous page may or may not be useful, sensible, possible for a given recommendation and so should not be recommneded as required. It could, however, be mentioned in a note or as an optional extension (it would be cool that all the implementations that make use of it use it in the same manner).

andrewyatz commented 2 weeks ago

This is great thank you. I've added these suggestions. Please see what you think especially as your comments are with respect to promoting consistency. Really appreciate this