Closed sarayourfriend closed 5 months ago
Thanks for putting this together! I think a new ThrottledApplication concept which enforces result page size and page depth is a great idea. The approaches to avoiding breaking existing integrations also sound good.
It sounds like this will require a few issues. Do you think it warrants a dedicated project or milestone?
I think it only requires one issue for implementation, with a follow-up issue of non-code work to check on the monitoring.
The Django admin changes should be minimal, just configuring the new field to be editable.
We already have facilities for adding warnings to searches, which are the only relevant requests for this work. Likewise, we have established logging practices, adding a new log line is trivial (though we'll want to be smart about the properties of that line).
Adding a note to the documentation doesn't seem like a separate issue either.
Please let me know if I've missed something you think would warrant a separate issue.
Nope, thanks @sarayourfriend
Great, I'll update the PR description with specific implementation details.
Updated with details for how to implement. I'll go ahead and run a logs query in production to see if I can identify any requesters using these large query parameters that should be able to continue doing so (I'm pretty sceptical!).
Thanks for bringing all this information together, and for sharing the queries made to gather this information - that's awesome!
One question about page depth: I do sometimes find myself going past 5 pages in the frontend when searching for images. However, 5 pages at the default of 20 is only 100 images. Should we consider the anonymous limit as a total image limit, rather than discrete page size/page depth limits? Or maybe setting the max page depth to 10 would be the best compromise, especially given how many of the high-requesters you posted above are above 5 for max page depth.
Limiting max result depth, in addition to page size, rather than max page number, sounds good to me! It simplifies the rationale of the limit too, at least to me. We're limiting to 250 results, rather than technically a variety of limits based on the page size used.
Combined with rate limiting, sounds good to me!
Problem
While the Openverse ToS disallows scraping, our current authentication scheme and API parameter management is not conducive for preventing the easiest classes of abuse. Right now it is trivial for someone to authenticate at the most basic level and scrape the Openverse API. Even banning a single client application is easy to get around.
Description
My proposed solution is to reduce the allowance of page size and page depth for the default authentication scope.
To do this, add a field to
ThrottledApplication
similar torate_limit_model
namedprivileges
[^alternative-names]. The field should be anArrayField
with abase_field
ofCharField
withchoices
set to a list of privileges. The choices to start with should beincreased_page_size
andincreased_query_depth
. The default of the field should be an empty list (i.e., no special privileges).[^alternative-names]: I also considered
api_scopes
,query_scope
orparameter_allowances
but rejected them.api_scopes
has too much potential for confusion with OAuth scopes, which isn't really the question here. The latter two have an inherently narrow application to just search query. This field could represent access to other non-search endpoints, so a non-search specific name is more appropriate.Make the field editable in Django admin so maintainers can add or remove privileges for individual client apps.
Once the new field is available, update the serializer to check the relevant entries in the new field. If the request parameters exceed any given privilege, add a warning to the request following the example of the source parameter warning:
https://github.com/WordPress/openverse/blob/7dd355000f4d1d4bd597e3256b1f804d43f13f18/api/api/serializers/media_serializers.py#L407-L417
The warning text should link to a section of the API documentation that explains the privileges and how to request increased access. This would be new documentation, and the implementing PR must add it. Unauthenticated requests should continue to return an error response as they do today.
We also need to be able to evaluate existing usage patterns and ensure that any integrations known to be safe and relying on these features that require a higher level of trust will continue to work. We should also ensure that existing requests, regardless of trust level, do not 403. Instead, I propose that we clamp parameters to the maximum value of their application scope, and along with the warning mentioned above, monitor the frequency of requests exceeding granted privileges on a per-application basis. A new log line for each request that exceeds privileges that includes the ungranted privilege and the client application name would be sufficient for analysis after deployment.
When a request exceeds the privileges, clamp the parameter down to the anonymous maximum.
Specifics for each parameter
Based on the table below, I've made suggestions for behaviour for each specific parameter
The table was pulled from logs analysis using this query analysing the last month of API nginx logs for paginated requests
``` filter @logStream like "nginx" | parse request /(?For privacy, I've passed the client application names through sha1
Of these, the two with a maxPageSize of 500 are known to be unwanted requesters, and almost certainly scrapers.
page_size
Authenticated requesters should get a default maximum of 50. This is based on usage from known integrations where integrations we are aware of and have relationships with cap out their page size at 50. There is one integration that made over 100 requests in the last month that went up to 50 page size. I don't see a problem with that, and it seems as good a number as any to create the somewhat arbitrary cut-off of default privileges granted to authenticated requests.
page
If the page size maxes out at 50, page maxing out at 20 would be 1000 images. I don't necessarily feel that's excessive, but I think we should reduce the default privilege to max out at 10 pages of requests, if not 5. Most requesters do not go above 5, and all requesters should be respecting the page_count parameter on responses anyway, so reducing the maximum will not cause integration problems. Openverse search relevancy beyond the first few pages is not that useful in most cases anyway.
I ran another query to get the average page, and most are below 5 on average, even for large integrations.
I propose we start with 5 given the average case will not be affected, and given what we know about Openverse search relevancy. That creates a maximum result count of 250.
Alternatives
Continue to have extremely broad and generous API allowances that make ToS abuse trivial and impossible to reasonably regulate. We can just "ignore" this issue. If we do, I question whether our ToS would ever be enforceable or how transparent our application of the policy really could ever be.