Closed mlissner closed 4 months ago
OK, turns out this doesn't use DB cursors, but uses DRF "cursors", that work like this: https://use-the-index-luke.com/no-offset
@albertisfu, when we do the API v4, I think this is probably important to do as well.
Well, the docket-entries
and recap-documents
API endpoints are now so slow you can't query their root URLs and they only work when filtered. Good thing we're working on API v4.
I've been testing cursor pagination for its use in DB-based endpoints. Here are my findings so we can decide how to implement it.
CursorPagination
class works only with one or a combination of sorting keys. These fields are used to navigate across documents, ensuring consistent pagination.ID
or the date_created
field.None
fields don't work with cursor pagination. For example, date_filed
can be None
in dockets.
date_filed desc
, None
values are shown first, similar to regular pagination.None
values are treated as equal, forward pagination uses an offset to navigate forward. However, if you try to go back to the previous page, the query fails because it can't handle date_filed__gt: None
, which is used in backward pagination.date_filed asc
, older dockets are shown first, which is correct. But when navigating forward, the latest object with a unique date_filed
is used to retrieve the next page. A query like date_filed__gt: '2024-05-21'
filters out all dockets with date_filed=None
, so only dockets with a date_filed
are retrieved in results. I tested some possible workarounds to handle this problem.
None
date_filed
(or other nullable sorting fields) by default. However, I don't think it is viable to exclude this content.None
values in dockets with a value that can be handled for sorting. I tried:queryset = (
Docket.objects.annotate(
date_filed_non_null=Coalesce('date_filed', Value('9999-12-31', output_field=DateField()))
).select_related(
"court",
"assigned_to",
"referred_to",
"originating_court_information",
"idb_data",
)
.prefetch_related("panel", "clusters", "audio_files", "tags")
)
In this case, if the date_filed
is None
, it returns 9999-12-31
, which will keep the behavior of showing dockets with None
values first when sorting DESC and at the end when sorting ASC. However, there is a problem when going backwards. Since the actual values (None
) are displayed in the results, the cursor pagination can't handle proper backward navigation because it looks for objects with date_filed < 9999-12-31
, so all the objects with None
date_filed
from previous pages are missed. It is possible to handle this problem by rendering the obfuscated date_filed
in results or overriding the CursorPagination
class to handle those values internally without rendering the actual value. However, even with this workaround, pagination won't be 100% consistent when navigating objects with None
values since pagination will be based on an offset.
In brief:
ID
and date_created
."date_created",
"date_modified",
"date_blocked",
"date_filed",
"date_terminated",
"date_last_filing",
All of these sorting keys can be nullable. For each model, we'd need to check which fields can be nullable and apply a workaround.
@mlissner What do you think?
Additionally, regarding the V4 migration on these endpoints:
Viewsets
for V4. However, this will depend on our decision about cursor pagination and if it can be handled via middleware without affecting V3 behavior.Hm, well, all of these solutions aren't that great. Two thoughts:
Is it possible to add a second sort parameter to the pagination so that when it's null it uses the second one? We could use id
for that, right?
If that's not possible, what about just using cursor pagination on some of the fields while maintaining the current ones? Does DRF allow multiple types of pagination on a view?
I like the idea of a middleware. Nice idea.
Is it possible to add a second sort parameter to the pagination so that when it's null it uses the second one? We could use id for that, right?
Yeah, this is possible and actually this is mandatory for sorting keys that are not unique. All the tests I performed here, including the main sorting key (e.g: date_filed) + ID
However, this doesn't solve the issue of null values in the main sorting key, since the null + ID value is always passed to the query, leading to the behavior described above.
If that's not possible, what about just using cursor pagination on some of the fields while maintaining the current ones? Does DRF allow multiple types of pagination on a view?
By default, DRF doesn't support this. However, I think we can achieve it by tweaking the view. If the sorting keys don't meet the requirements for cursor pagination, the normal pagination will be used. Cursor pagination can only be used on unique non-null values, which I believe can only apply to the ID
and date_created
sorting.
All other sorting fields will use normal pagination.
That sounds like a decent solution to me. I think filtering and deep pagination are more important than sorting, so if only some of our fields can be used to do deep pagination, we can just document that and that should work well for folks.
Fair to close this now, @albertisfu?
Yeah, closing it!
We have two problems with our API right now performance wise (that I want to investigate here):
Looking briefly at our performance highlights in AWS, it looks like our API requests spend a lot of time doing
SELECT Count(*)
queries. These queries are needed to do pagination of the API so you can know how many pages there are.Doing deep pagination (onto page 1,000, say) causes performance degradation. We currently have a warning about this in the docs:
The solution to both of these issues appears to be to use cursor-based pagination, as described here:
https://www.django-rest-framework.org/api-guide/pagination/#cursorpagination
There are a few non-backwards compatible limitations though:
It only supports fields that don't change and are unique, so some of our ordering fields won't be possible.
You can't go to an arbitrary page number of the results, and can only go to the next or previous page.
Obviously, perhaps: Our current page numbering scheme (
&page=xx
) won't work anymore.So I think this is probably worth doing, but probably worth calling API version 4.0 (we're at 3.7 now). I want to do some more investigation to see if there's a way we could enable more than one pagination type for an endpoint, as a way of phasing out the old version, but I'm not sure how that'd work.