WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
252 stars 202 forks source link

Reduce permissions of default authentication scope #4324

Closed sarayourfriend closed 5 months ago

sarayourfriend commented 5 months ago

Problem

While the Openverse ToS disallows scraping, our current authentication scheme and API parameter management is not conducive for preventing the easiest classes of abuse. Right now it is trivial for someone to authenticate at the most basic level and scrape the Openverse API. Even banning a single client application is easy to get around.

Description

My proposed solution is to reduce the allowance of page size and page depth for the default authentication scope.

To do this, add a field to ThrottledApplication similar to rate_limit_model named privileges[^alternative-names]. The field should be an ArrayField with a base_field of CharField with choices set to a list of privileges. The choices to start with should be increased_page_size and increased_query_depth. The default of the field should be an empty list (i.e., no special privileges).

This data model gives us the flexibility to add or remove different types of privileges for a given ThrottledApplication, without needing to automatically opt everyone with any privilege into all privileges or to create a confusing hierarchy of combinations. If we used a boolean or a single CharField, then True would need to imply all expanded parameter allowances, or each individual choice would need to represent one or a mix of the choices. It's much cleaner to use the ArrayField. ThrottledApplication is pulled in full for every request with Bearer auth, so the list will be immediately available for the serializer to check permissions on the request.

[^alternative-names]: I also considered api_scopes, query_scope or parameter_allowances but rejected them. api_scopes has too much potential for confusion with OAuth scopes, which isn't really the question here. The latter two have an inherently narrow application to just search query. This field could represent access to other non-search endpoints, so a non-search specific name is more appropriate.

Make the field editable in Django admin so maintainers can add or remove privileges for individual client apps.

Once the new field is available, update the serializer to check the relevant entries in the new field. If the request parameters exceed any given privilege, add a warning to the request following the example of the source parameter warning:

https://github.com/WordPress/openverse/blob/7dd355000f4d1d4bd597e3256b1f804d43f13f18/api/api/serializers/media_serializers.py#L407-L417

The warning text should link to a section of the API documentation that explains the privileges and how to request increased access. This would be new documentation, and the implementing PR must add it. Unauthenticated requests should continue to return an error response as they do today.

We also need to be able to evaluate existing usage patterns and ensure that any integrations known to be safe and relying on these features that require a higher level of trust will continue to work. We should also ensure that existing requests, regardless of trust level, do not 403. Instead, I propose that we clamp parameters to the maximum value of their application scope, and along with the warning mentioned above, monitor the frequency of requests exceeding granted privileges on a per-application basis. A new log line for each request that exceeds privileges that includes the ungranted privilege and the client application name would be sufficient for analysis after deployment.

When a request exceeds the privileges, clamp the parameter down to the anonymous maximum.

Specifics for each parameter

Based on the table below, I've made suggestions for behaviour for each specific parameter

The table was pulled from logs analysis using this query analysing the last month of API nginx logs for paginated requests ``` filter @logStream like "nginx" | parse request /(?(GET|POST|PUT|DELETE)) \/v1\/(?(images|audio))\/((?[0-9a-z\-\/]*)\/)?(?\?.*)? / | parse queryParams /page=(?\d+)/ | parse queryParams /page_size=(?\d+)/ | fields isempty(resource) and not isempty(mediaType) as isSearch, page * least(20, pageSize) as paginationDepth | filter isSearch and page > 1 and status >= 200 and status <= 300 | stats count() as paginatedQueryCount, max(greatest(1, page)) as maxPage, min(least(1, page)) as minPage, max(greatest(20, pageSize)) as maxPageSize, min(least(20, pageSize)) as minPageSize, avg(least(20, pageSize)) as averagePageSize by client_application_name ```

For privacy, I've passed the client application names through sha1

client_application_name paginatedQueryCount maxPage minPage maxPageSize minPageSize averagePageSize
666581 100 1 20 1 19.992
5dceabe8e24ce99c33a4e2c82b44e7d7c04df303 12167 20 1 24 20 20
235e9ac0fe5d168ed2c8245440cbce8c3deca272 153 11 1 50 20 20
09323e87079447d33d3564d1111d060693310236 246 12 1 25 20 20
46dc23b46b51e47c710bdae407469947973b7e85 106 11 1 500 1 18.8962
76e45af6b84391eb0a74170a0c281a9f33bcf3e3 447 20 1 20 20 20
da0e53723b9999ce1d916823b2af6f732d458bb6 190 20 1 24 20 20
101c63b2f5d652671f29898455f17da436074a2b 123 20 1 500 10 16.9106
3ea411a77707d2551bee90c10260d4f2ab1bad0c 87 6 1 25 20 20
912a6241f10c3faa0460b71511259e1f15109b7d 19 6 1 25 20 20
2a3aec6e30b424572565e331fa566fa1615d249b 17 5 1 25 20 20
3afac6d593a6457ca512f6664e135174a03f7c1f 2 2 1 25 20 20
84023988640fd26e8f592a42160a2cfb536aa739 41 5 1 500 20 20
0854151fc8312612081a246d55ef8b3fe731cc61 17 3 1 100 20 20
b8d84862bbaed24eba05fa85cc77373f75cac31b 4 5 1 20 20 20
a3213452eee0959445e8828c50637f9fd043ab18 5 2 1 20 20 20
8690a4e435333974f3d632c25351560abcac546e 6 3 1 25 20 20
b1c7ddb28dcbf9fe1c873c57fce8147db97d1c8a 1 3 1 20 20 20
096fdac8d366831ded024ebd947369febb34472e 2 2 1 20 20 20
9040e29a872a36dc4b6a5efc0e86e65328947469 1 2 1 30 20 20
2e2eb310033a0b99997c59a981197678d7b82991 10 5 1 20 20 20

Of these, the two with a maxPageSize of 500 are known to be unwanted requesters, and almost certainly scrapers.

page_size

Authenticated requesters should get a default maximum of 50. This is based on usage from known integrations where integrations we are aware of and have relationships with cap out their page size at 50. There is one integration that made over 100 requests in the last month that went up to 50 page size. I don't see a problem with that, and it seems as good a number as any to create the somewhat arbitrary cut-off of default privileges granted to authenticated requests.

page

If the page size maxes out at 50, page maxing out at 20 would be 1000 images. I don't necessarily feel that's excessive, but I think we should reduce the default privilege to max out at 10 pages of requests, if not 5. Most requesters do not go above 5, and all requesters should be respecting the page_count parameter on responses anyway, so reducing the maximum will not cause integration problems. Openverse search relevancy beyond the first few pages is not that useful in most cases anyway.

I ran another query to get the average page, and most are below 5 on average, even for large integrations.

I propose we start with 5 given the average case will not be affected, and given what we know about Openverse search relevancy. That creates a maximum result count of 250.

Alternatives

Continue to have extremely broad and generous API allowances that make ToS abuse trivial and impossible to reasonably regulate. We can just "ignore" this issue. If we do, I question whether our ToS would ever be enforceable or how transparent our application of the policy really could ever be.

zackkrida commented 5 months ago

Thanks for putting this together! I think a new ThrottledApplication concept which enforces result page size and page depth is a great idea. The approaches to avoiding breaking existing integrations also sound good.

It sounds like this will require a few issues. Do you think it warrants a dedicated project or milestone?

sarayourfriend commented 5 months ago

I think it only requires one issue for implementation, with a follow-up issue of non-code work to check on the monitoring.

The Django admin changes should be minimal, just configuring the new field to be editable.

We already have facilities for adding warnings to searches, which are the only relevant requests for this work. Likewise, we have established logging practices, adding a new log line is trivial (though we'll want to be smart about the properties of that line).

Adding a note to the documentation doesn't seem like a separate issue either.

Please let me know if I've missed something you think would warrant a separate issue.

zackkrida commented 5 months ago

Nope, thanks @sarayourfriend

sarayourfriend commented 5 months ago

Great, I'll update the PR description with specific implementation details.

Updated with details for how to implement. I'll go ahead and run a logs query in production to see if I can identify any requesters using these large query parameters that should be able to continue doing so (I'm pretty sceptical!).

AetherUnbound commented 5 months ago

Thanks for bringing all this information together, and for sharing the queries made to gather this information - that's awesome!

One question about page depth: I do sometimes find myself going past 5 pages in the frontend when searching for images. However, 5 pages at the default of 20 is only 100 images. Should we consider the anonymous limit as a total image limit, rather than discrete page size/page depth limits? Or maybe setting the max page depth to 10 would be the best compromise, especially given how many of the high-requesters you posted above are above 5 for max page depth.

sarayourfriend commented 5 months ago

Limiting max result depth, in addition to page size, rather than max page number, sounds good to me! It simplifies the rationale of the limit too, at least to me. We're limiting to 250 results, rather than technically a variety of limits based on the page size used.

Combined with rate limiting, sounds good to me!