WordPress / openverse-api

The Openverse API allows programmatic access to search for CC-licensed and public domain digital media.
https://api.openverse.engineering/v1
MIT License
76 stars 50 forks source link

Create pre-filtered secondary indexes and add ability to automatically filter sensitive terms at query time #1108

Closed sarayourfriend closed 1 year ago

sarayourfriend commented 1 year ago

Fixes

Related to https://github.com/WordPress/openverse/issues/721 by @zackkrida Fixes https://github.com/WordPress/openverse/issues/750 by @obulat (at least potentially, based on loose decisions we made last week during offline chats)

Description

Adds a new settings variable, SENSITIVE_TERMS. It should be a comma separated list of terms to exclude, parsed on application startup. This variable exists for both the ingestion server and the Django API.

In the ingestion server, the terms are used to create a filtered index via the reindex API. I've updated load_sample_data.sh to created this filtered index and allow for easy local testing. Note that the filtered index uses the terms "dog" and "water".

In the Django API, the terms are excluded from search via an inverted MultiMatch query. The approach is naive and may perform poorly in production. It may be necessary to connect a local box to the production Elasticsearch to try a query with "dog" excluded, for example, to see how it performs (or some other way to test, maybe using staging Elasticsearch cluster, cc @AetherUnbound @obulat @krysal who may all have better ideas for how to safely test). I also don't know if this will scale if we have, say, 100 or so terms. I'd wager it should be fine, but it's a completely naive guess founded on essentially nothing other than trusting Elasticsearch's ability to aggregate such queries.

I don't know if this is the right approach. The significant alternative I can imagine is to store the list of terms in Postgres and caching their retrieval by 30 minutes to an hour (potentially busting the cache immediately when the model saves). This would allow us to change the terms without needing to redeploy. It makes sharing the list slightly harder because we'd have to give Django Admin access or export it from Django Admin. Leaving it as an environment variable allows the sharer to copy/paste the list out of the private infrastructure repository instead. A production redeployment currently takes ~10 minutes and are tedious now but will be easier (and faster when staying on the same version) in the future once the ECS migration is completed.

Testing Instructions

By default, local environment is set up with two excluded terms for the filtered index: "dog" and "water". For the Django API, the excluded terms are "spoiled" and "perched". I'd advise searching these terms on main first, before running this branch, and noting the result counts and such for related queries. I tested using images so the following instructions only include specifics for images, but the same principles will apply for audio.

Make a query for "dog" and "water" and pick out a separate word that would hit that document. For example, "running" will include 2 "water" documents (amongst others of people running).

Afterwards, run this branch and make the same queries with mature=True and mature=False (the latter being the default). You should see different behaviour. When mature results are excluded, for "running", you should get 2 less documents, equivalent to searching "running -water". For "dogs" you should get no documents. When mature results are included via the query parameter, you'll receive the same results as on main.

Again, those are for the images index, but the same principles will apply for audio.

To test the API terms, follow the same pattern of testing, but use the terms listed above for the query-time feature instead of the filtered index feature.

Checklist

[best_practices]: https://git-scm.com/book/en/v2/Distributed-Git-Contributing-to-a-Project#_commit_guidelines

Developer Certificate of Origin

Developer Certificate of Origin ``` Developer Certificate of Origin Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. 1 Letterman Drive Suite D4700 San Francisco, CA, 94129 Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. ```
github-actions[bot] commented 1 year ago

API Developer Docs Preview: Ready

https://wordpress.github.io/openverse-api/_preview/1108

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

sarayourfriend commented 1 year ago

This PR will need to include an update to the "Search Algorithm" documentation describing this new feature.

zackkrida commented 1 year ago

This is awesome! I had a random thought about how to test it (and potentially other changes like this) against production data:

What if we only applied this to searches with a "secret password" in the query? So, it'd work something like this in production:

  1. A user searches "dog". No sensitive keyword filtering is applied.
  2. An Openverse contributor searches "dog SECRET_PASSWORD". Sensitive keyword filtering is applied and the SECRET_PASSWORD is stripped from the query before searching ES

This would allow us to test in production for some arbitrary period so we could make performance comparisons. Of course we can already do things like this with the API and user permissions, but this way allows us to quickly compare the same query against the same API or frontend instance with different functionality enabled.

sarayourfriend commented 1 year ago

@dhruvkb I was wondering if you could help me take a look at the ingestion server tests? I made some changes to the ordering so that it was easier to put new tests in the middle without having to update every other test's order number. That appears to be working fine, but no matter where I put the tests for creating the filtered index and pointing the alias in the order, it seems to cause subsequent tests that use the promote action to fail. I'm not entirely sure why. I'd appreciate any insight you might have into this issue.

stacimc commented 1 year ago

This is so cool! I'm really excited to test this with more data 😮 The 'password' idea from @zackkrida sounds really interesting.

Is this PR meant as an exploration or do you intend to actively keep pushing this one? I have some questions but obviously not urgent if this is on the backburner.

I understand the filtered index in the ingestion server. If I search "dog photo" with the mature filter enabled, I will end up with zero results because everything matching my query is also in the index. I will not receive partial match results (meaning, things that match "photo" but not "dog"). The filtering also happens on every query including ones that don't intentionally query on a sensitive term: so if I search just "photo", this time I'll get lots of results, but I still won't see any dog photos (or water photos for that matter).

I am confused about the API filtering, though. It looks like it detects sensitive terms in the query params, and then excludes results that match only those terms? So, if one of the configured terms is "perched":

Is that interpretation of what it's supposed to do correct? Records matching sensitive terms defined in the API's list are only filtered when those terms actually appear in query params?

I think I'm wrong about that, but I'm not sure what I'm missing. When I query ?q="bird perched" locally with the mature filter enabled, I get 25 results including many with "perched" in the title or tags 🤔 When I add mature=False I get one additional result (I did not have time to look into what was different with the one excluded record).

sarayourfriend commented 1 year ago

Is this PR meant as an exploration or do you intend to actively keep pushing this one? I have some questions but obviously not urgent if this is on the backburner.

I thought it could be merged, but I still have not received clarification from others on the team of whether they want that to happen (I asked during our retrospective). I haven't heard anyone say to "stop" working on this or that it shouldn't move forward, so assuming there aren't big problems with it, we could try it. Then again, if it's not something we think we would enable any time soon, then I should close the PR as an unmerged proof of concept to be referred back to later. Either way does not matter to me.

I am confused about the API filtering, though. It looks like it detects sensitive terms in the query params, and then excludes results that match only those terms? So, if one of the configured terms is "perched":

Your summary is incorrect but the behaviour you're seeing is reproducible. Just to clarify the behaviour first though: the code applies the sensitive word filter always to all queries, for all sensitive words, regardless of whether they appear in the query. In fact, it applies the filter in precisely the same way as the filtered index is produced, so the behaviour is (essentially) the same.

This is the part of the code that applies the filter: https://github.com/WordPress/openverse-api/pull/1108/files#diff-1f1af6f89cdc3071047abe1d692e5803c38df879f2da130bf178dcb444cb8e28R345-R347

It doesn't check any query params aside from whether the mature filter is disabled. None of it's operation or implementation depends on any other query parameters or their values.

There was, however, a bug in the environment variable reading implementation. It wasn't applying any sensitive term filters at runtime because the cast was creating a generator, not a tuple. I've fixed this now and the behaviour you were seeing is no longer reproducible. If you search "bird" you won't see and results for "perched". If you search "bird perched", you will also not see results for "perched". (Unless you disable the mature filter, to be clear). When you search, if you look at the logs, you'll be able to see the multimatch queries being sent at all times, regardless of what the other terms of the query are.

sarayourfriend commented 1 year ago

Closing this PR again as I don't have any idea whether anyone else wants this to move forward, and I do not feel confident about getting it reviewed and merged before the 17th when @dhruvkb will be doing the monorepo migration.

stacimc commented 1 year ago

Closing this PR again as I don't have any idea whether anyone else wants this to move forward and I do not feel confident about getting it reviewed and merged before the 17th

Noted. Leaving the comment I was working on for posterity when this is revisited. For the record I think this is really exciting.

Your summary is incorrect but the behaviour you're seeing is reproducible. Just to clarify the behaviour first though: the code applies the sensitive word filter always to all queries, for all sensitive words, regardless of whether they appear in the query. In fact, it applies the filter in precisely the same way as the filtered index is produced, so the behaviour is (essentially) the same.

That makes way more sense 😅 I was trying to make sense of my test behavior and assumed the filters must do different things and work together somehow. The context I was missing/had forgotten was in the comments of one of the linked issues: they do the same thing, but the reason for also having filtering in the API is to allow for adding additional sensitive terms in a hypothetical emergency without a redeploy.

Thank you for the explanation! I'm inclined to agree with your comment about the API filtering possibly not being needed, especially to your point on this issue about the deploys being fairly quick and getting even better 😄 That said, you've already done the work and it works great!