feat(jobs): pagination - Githubissues

aldbr commented 3 months ago

Here I provide a first implementation of the pagination mechanism for the jobs, mostly based on https://github.com/DIRACGrid/diracx/pull/6.

As I explored different examples, it became clear to me that there is no one-size-fits-all solution for pagination. Instead, various implementation possibilities exist, each offering unique approaches at different stages.

Pagination Strategies:

I considered two primary pagination strategies:

Cursor-based pagination relies on opaque tokens (cursors) to navigate through results. Clients provide a cursor to fetch the next set of results, making it suitable for real-time data and ensuring stable pagination despite changes in the dataset. Such strategy requires additional logic to manage cursors and handle edge cases and is not suitable to easily jump to specific pages.
Page-based pagination divides results into fixed-size pages, allowing clients to request specific pages using page numbers. This approach is simpler to implement but may suffer from inconsistencies if the dataset changes frequently. It is less efficient for large datasets.

My opinion: we should prioritize simplicity for the time being. Our primary requirement is to retrieve the last 10, 100, or 1,000 jobs. Occasionally, it's useful to jump to a different page, such as when checking if a particular issue has occurred before. While there may be some cases where we need to fetch a large number of jobs at once, such instances are rare. Therefore, minor inconsistencies should not pose a significant problem. I would rather choose the page-based pagination.

Metadata Conveyance:

Conveying metadata to clients is essential for effective navigation as we return partial results. Common methods include Web Linking, Content-Range headers, and embedding metadata within the JSON response.

Web Linking (used by github): https://datatracker.ietf.org/doc/html/rfc8288. This method provides links to the first, previous, current, next, and last pages in the Link header. While convenient for navigation, it requires clients to parse HTTP headers.

Link: <https://dirac/api/jobs?page=2&per_page=100>; rel="prev", <https://dirac/api/jobs?page=4&per_page=100>; rel="next", <https://dirac/api/jobs?page=515&per_page=100>; rel="last", <https://dirac/api/jobs?page=1&per_page=100>; rel="first"

Content-Range Header: https://www.rfc-editor.org/rfc/rfc9110.html#section-14.4. This approach is similar to web linking but much more compact, where the first and last items, along with the total number of items, are provided through the Content-Range header.

Content-Range: <unit> <first item>-<last item>/<total>
Content-Range: jobs 1-10/100

Envelope in the JSON: https://jsonapi.org/format/#fetching-pagination. This method involves directly including metadata along with the data in the response. While this ensures that all clients have easy access to pagination details, it also increases the complexity of the payload.

My opinion: I initially implemented the Content-Range approach based on https://github.com/DIRACGrid/diracx/pull/6. However, I believe that web linking could also be beneficial as it would fit perfectly with the pagination parameters. Including metadata directly in the JSON is straightforward, but it would require additional parsing.

aldbr commented 3 months ago

After trying to link the pages to diracx-web, I realized that page-based pagination combined with the Content-Range header was enough for such a use case.

Now for the agents needing to fetch a large number of items while guaranteeing consistency, we could still tweak the per-page parameter by setting a very large number. There are a few cons of course: it would take a few seconds to fetch a large number of items and we would need to make sure that per-page is large enough to cover all the needed items.

UPDATE: As a counter-argument, now that we can sort items using any parameter in any order, pagination is not as essential as it is within the current DIRAC implementation. Instead of going to the last page to examine old items, we could just sort the items differently (and thus rely on a cursor-based pagination).

fstagni commented 3 months ago

I would rather choose the page-based pagination.

On a first look, I agree.

chrisburr commented 3 months ago

I think we should use page-based pagination for most things but have the option of having a reliable pagination for critical operations. Perhaps we could make it simplier by having the option of:

passing after instead of page
after is only supported if sorting by the primary key of the table to keep the implementation simple
the recommended way of iterating programatically would be with Web Linking so the use of after becomes an implementation detail

For inspiration:

aldbr commented 3 months ago

I think we should use page-based pagination for most things but have the option of having a reliable pagination for critical operations. Perhaps we could make it simplier by having the option of:

passing after instead of page

after is only supported if sorting by the primary key of the table to keep the implementation simple

@chrisburr what you are describing is a (simple) cursor-based pagination if I understand correctly. Do you suggest we should implement both methods (at least for the critical operations)?

the recommended way of iterating programatically would be with Web Linking so the use of after becomes an implementation detail

Indeed, if we go with cursor-based pagination, this would be a nice way of iterating through the previous and next pages without including the cursor within the json result.

chaen commented 3 months ago

@chrisburr what would be the use case ? I can hardly imagine any other use case than having the first N elements, so basically the first page with a given length. Also, we have the problem that we may not have a stable order, so I don't think it is achievable. If we have such a need for a very specific case, then we may want to add that behavior where it's needed. But having it in a generic way seems out of reach and not worth it to me

chaen commented 3 months ago

@chrisburr and I had a chat and I think we are on the same page: go for the page based approach

aldbr commented 3 months ago

Alright, then you can start reviewing the PR

DIRACGrid / diracx

feat(jobs): pagination #243

Pagination Strategies:

Metadata Conveyance: