Investigate using CursorPagination for big (or all) tables in the API

mlissner commented 5 years ago

We have two problems with our API right now performance wise (that I want to investigate here):

Looking briefly at our performance highlights in AWS, it looks like our API requests spend a lot of time doing SELECT Count(*) queries. These queries are needed to do pagination of the API so you can know how many pages there are.
Doing deep pagination (onto page 1,000, say) causes performance degradation. We currently have a warning about this in the docs:

When doing deep crawls of the data, going to very high page numbers will incur increasingly bad performance. This is common in databases because to go to a high page number means sorting the entire data set, then counting to the correct location. Page 50 isn't a big deal. Page 2,000 starts getting slower.

The solution to both of these issues appears to be to use cursor-based pagination, as described here:

https://www.django-rest-framework.org/api-guide/pagination/#cursorpagination

There are a few non-backwards compatible limitations though:

It only supports fields that don't change and are unique, so some of our ordering fields won't be possible.
You can't go to an arbitrary page number of the results, and can only go to the next or previous page.
Obviously, perhaps: Our current page numbering scheme (&page=xx) won't work anymore.

So I think this is probably worth doing, but probably worth calling API version 4.0 (we're at 3.7 now). I want to do some more investigation to see if there's a way we could enable more than one pagination type for an endpoint, as a way of phasing out the old version, but I'm not sure how that'd work.

mlissner commented 5 years ago

OK, turns out this doesn't use DB cursors, but uses DRF "cursors", that work like this: https://use-the-index-luke.com/no-offset

mlissner commented 1 year ago

@albertisfu, when we do the API v4, I think this is probably important to do as well.

mlissner commented 7 months ago

Well, the docket-entries and recap-documents API endpoints are now so slow you can't query their root URLs and they only work when filtered. Good thing we're working on API v4.

albertisfu commented 5 months ago

I've been testing cursor pagination for its use in DB-based endpoints. Here are my findings so we can decide how to implement it.

The DRF CursorPagination class works only with one or a combination of sorting keys. These fields are used to navigate across documents, ensuring consistent pagination.
To get all the benefits of cursor pagination, such as performance and avoiding duplicated items, it is required to use a unique or almost unique field for sorting. This can be the document ID or the date_created field.
It is possible to use non-unique fields for sorting; however, consistency across pagination is not guaranteed. When using these fields, an offset is introduced to the cursor to handle documents with the same sorting field, but this offset doesn't guarantee that a document won't be duplicated on a previous or next page.
I found that None fields don't work with cursor pagination. For example, date_filed can be None in dockets.
- When sorting by date_filed desc, None values are shown first, similar to regular pagination.
- Since None values are treated as equal, forward pagination uses an offset to navigate forward. However, if you try to go back to the previous page, the query fails because it can't handle date_filed__gt: None, which is used in backward pagination.
- When sorting by date_filed asc, older dockets are shown first, which is correct. But when navigating forward, the latest object with a unique date_filed is used to retrieve the next page. A query like date_filed__gt: '2024-05-21' filters out all dockets with date_filed=None, so only dockets with a date_filed are retrieved in results.

I tested some possible workarounds to handle this problem.

We can exclude all dockets with a None date_filed (or other nullable sorting fields) by default. However, I don't think it is viable to exclude this content.
Another option is to obfuscate None values in dockets with a value that can be handled for sorting. I tried:

queryset = (
        Docket.objects.annotate(
            date_filed_non_null=Coalesce('date_filed', Value('9999-12-31', output_field=DateField()))
        ).select_related(
            "court",
            "assigned_to",
            "referred_to",
            "originating_court_information",
            "idb_data",
        )
        .prefetch_related("panel", "clusters", "audio_files", "tags")
    )

In this case, if the date_filed is None, it returns 9999-12-31, which will keep the behavior of showing dockets with None values first when sorting DESC and at the end when sorting ASC. However, there is a problem when going backwards. Since the actual values (None) are displayed in the results, the cursor pagination can't handle proper backward navigation because it looks for objects with date_filed < 9999-12-31, so all the objects with None date_filed from previous pages are missed. It is possible to handle this problem by rendering the obfuscated date_filed in results or overriding the CursorPagination class to handle those values internally without rendering the actual value. However, even with this workaround, pagination won't be 100% consistent when navigating objects with None values since pagination will be based on an offset.

In brief:

Initially, it will be easy to implement cursor pagination for all the DB-based endpoints by only adding support for a couple of sorting keys: ID and date_created.
Adding cursor pagination support for other keys is more complex since each model's keys should be revised and a fix like the ones proposed above applied. For instance, in Dockets:

"date_created",
"date_modified",
"date_blocked",
"date_filed",
"date_terminated",
"date_last_filing",

All of these sorting keys can be nullable. For each model, we'd need to check which fields can be nullable and apply a workaround.

@mlissner What do you think?

Additionally, regarding the V4 migration on these endpoints:

We want to keep V3 working exactly the same for all DB-based endpoints so users can decide whether to move to V4 or continue using V3, correct?
I was thinking of using middleware that checks the API requested version and changes the default pagination class according to the API version (normal for V3 and cursor pagination for V4). This way, we might avoid creating new Viewsets for V4. However, this will depend on our decision about cursor pagination and if it can be handled via middleware without affecting V3 behavior.

mlissner commented 5 months ago

Hm, well, all of these solutions aren't that great. Two thoughts:

Is it possible to add a second sort parameter to the pagination so that when it's null it uses the second one? We could use id for that, right?
If that's not possible, what about just using cursor pagination on some of the fields while maintaining the current ones? Does DRF allow multiple types of pagination on a view?

I like the idea of a middleware. Nice idea.

albertisfu commented 5 months ago

Is it possible to add a second sort parameter to the pagination so that when it's null it uses the second one? We could use id for that, right?

Yeah, this is possible and actually this is mandatory for sorting keys that are not unique. All the tests I performed here, including the main sorting key (e.g: date_filed) + ID

However, this doesn't solve the issue of null values in the main sorting key, since the null + ID value is always passed to the query, leading to the behavior described above.

If that's not possible, what about just using cursor pagination on some of the fields while maintaining the current ones? Does DRF allow multiple types of pagination on a view?

By default, DRF doesn't support this. However, I think we can achieve it by tweaking the view. If the sorting keys don't meet the requirements for cursor pagination, the normal pagination will be used. Cursor pagination can only be used on unique non-null values, which I believe can only apply to the ID and date_created sorting. All other sorting fields will use normal pagination.

mlissner commented 5 months ago

That sounds like a decent solution to me. I think filtering and deep pagination are more important than sorting, so if only some of our fields can be used to do deep pagination, we can just document that and that should work well for folks.

mlissner commented 4 months ago

Fair to close this now, @albertisfu?

albertisfu commented 4 months ago

Yeah, closing it!

freelawproject / courtlistener

Investigate using CursorPagination for big (or all) tables in the API #930