freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
546 stars 151 forks source link

Investigate using CursorPagination for big (or all) tables in the API #930

Closed mlissner closed 4 months ago

mlissner commented 5 years ago

We have two problems with our API right now performance wise (that I want to investigate here):

  1. Looking briefly at our performance highlights in AWS, it looks like our API requests spend a lot of time doing SELECT Count(*) queries. These queries are needed to do pagination of the API so you can know how many pages there are.

  2. Doing deep pagination (onto page 1,000, say) causes performance degradation. We currently have a warning about this in the docs:

    When doing deep crawls of the data, going to very high page numbers will incur increasingly bad performance. This is common in databases because to go to a high page number means sorting the entire data set, then counting to the correct location. Page 50 isn't a big deal. Page 2,000 starts getting slower.

The solution to both of these issues appears to be to use cursor-based pagination, as described here:

https://www.django-rest-framework.org/api-guide/pagination/#cursorpagination

There are a few non-backwards compatible limitations though:

  1. It only supports fields that don't change and are unique, so some of our ordering fields won't be possible.

  2. You can't go to an arbitrary page number of the results, and can only go to the next or previous page.

  3. Obviously, perhaps: Our current page numbering scheme (&page=xx) won't work anymore.

So I think this is probably worth doing, but probably worth calling API version 4.0 (we're at 3.7 now). I want to do some more investigation to see if there's a way we could enable more than one pagination type for an endpoint, as a way of phasing out the old version, but I'm not sure how that'd work.

mlissner commented 5 years ago

OK, turns out this doesn't use DB cursors, but uses DRF "cursors", that work like this: https://use-the-index-luke.com/no-offset

mlissner commented 1 year ago

@albertisfu, when we do the API v4, I think this is probably important to do as well.

mlissner commented 7 months ago

Well, the docket-entries and recap-documents API endpoints are now so slow you can't query their root URLs and they only work when filtered. Good thing we're working on API v4.

albertisfu commented 5 months ago

I've been testing cursor pagination for its use in DB-based endpoints. Here are my findings so we can decide how to implement it.

I tested some possible workarounds to handle this problem.

queryset = (
        Docket.objects.annotate(
            date_filed_non_null=Coalesce('date_filed', Value('9999-12-31', output_field=DateField()))
        ).select_related(
            "court",
            "assigned_to",
            "referred_to",
            "originating_court_information",
            "idb_data",
        )
        .prefetch_related("panel", "clusters", "audio_files", "tags")
    )

In this case, if the date_filed is None, it returns 9999-12-31, which will keep the behavior of showing dockets with None values first when sorting DESC and at the end when sorting ASC. However, there is a problem when going backwards. Since the actual values (None) are displayed in the results, the cursor pagination can't handle proper backward navigation because it looks for objects with date_filed < 9999-12-31, so all the objects with None date_filed from previous pages are missed. It is possible to handle this problem by rendering the obfuscated date_filed in results or overriding the CursorPagination class to handle those values internally without rendering the actual value. However, even with this workaround, pagination won't be 100% consistent when navigating objects with None values since pagination will be based on an offset.

In brief:

"date_created",
"date_modified",
"date_blocked",
"date_filed",
"date_terminated",
"date_last_filing",

All of these sorting keys can be nullable. For each model, we'd need to check which fields can be nullable and apply a workaround.

@mlissner What do you think?

Additionally, regarding the V4 migration on these endpoints:

  1. We want to keep V3 working exactly the same for all DB-based endpoints so users can decide whether to move to V4 or continue using V3, correct?
  2. I was thinking of using middleware that checks the API requested version and changes the default pagination class according to the API version (normal for V3 and cursor pagination for V4). This way, we might avoid creating new Viewsets for V4. However, this will depend on our decision about cursor pagination and if it can be handled via middleware without affecting V3 behavior.
mlissner commented 5 months ago

Hm, well, all of these solutions aren't that great. Two thoughts:

  1. Is it possible to add a second sort parameter to the pagination so that when it's null it uses the second one? We could use id for that, right?

  2. If that's not possible, what about just using cursor pagination on some of the fields while maintaining the current ones? Does DRF allow multiple types of pagination on a view?

I like the idea of a middleware. Nice idea.

albertisfu commented 5 months ago

Is it possible to add a second sort parameter to the pagination so that when it's null it uses the second one? We could use id for that, right?

Yeah, this is possible and actually this is mandatory for sorting keys that are not unique. All the tests I performed here, including the main sorting key (e.g: date_filed) + ID

However, this doesn't solve the issue of null values in the main sorting key, since the null + ID value is always passed to the query, leading to the behavior described above.

If that's not possible, what about just using cursor pagination on some of the fields while maintaining the current ones? Does DRF allow multiple types of pagination on a view?

By default, DRF doesn't support this. However, I think we can achieve it by tweaking the view. If the sorting keys don't meet the requirements for cursor pagination, the normal pagination will be used. Cursor pagination can only be used on unique non-null values, which I believe can only apply to the ID and date_created sorting. All other sorting fields will use normal pagination.

mlissner commented 5 months ago

That sounds like a decent solution to me. I think filtering and deep pagination are more important than sorting, so if only some of our fields can be used to do deep pagination, we can just document that and that should work well for folks.

mlissner commented 4 months ago

Fair to close this now, @albertisfu?

albertisfu commented 4 months ago

Yeah, closing it!