Open AetherUnbound opened 8 months ago
@AetherUnbound I was wondering if you had any thoughts about when we might prioritise this? I think it would have benefits for service reliability, particularly with index redundancy, if we had fewer indexes spread across the data nodes (we could increase the number of replicas, for example, so that no single node carries a significant amount of any one index).
I also wanted to record some thoughts I had today about how this could work. In particular, regarding the new sensitive
field on the index. I'm not sure if you used "keyword" intentionally, Madison, but it would be great to have it actually be a keyword list field, mirroring the "sensitivity_reasons" that we return from the API. So if the work is sensitive due to a confirmed use report, we would add user_reported
to the sensitive
keyword list. If the work has sensitive text, detected during ingestion, we would add sensitive_text
to the sensitive
keyword list on the work.
When filtering for non-sensitive works (the default behaviour), instead of using mature=false
in the query, we'd query for works where sensitive
is an empty list. This makes the sensitivity reasons on the API response trivial construct. It would save at least one ES query for every search, and could save database queries as well for image results and audio results without waveform data.
That's a good question - I imagine it would affect the work being done as part of #383. Do you think it would make sense to include this in that effort at all, or should it be a separate prioritization?
Can you specify how you see the relation to the other project? The moderation workflows project wouldn't effect the way user reported sensitivity is detected during ingestion (which is done by checking for an entry in the SensitiveMedia
tables). That's remaining the same in the moderation workflows project.
I think this one could be done agnostic of the moderation workflow changes, but I might be missing a connection between them.
I see! I was thinking it might be involved because both (seemingly, to me) involve modifying the mature
/sensitive
aspects of the Elasticsearch index, so it might be more ideal to have them occur at the same time. If they can be done distinctly, then perhaps that's best.
Gotcha. #383 doesn't actually won't change anything about the way we handle mature
/sensitive
in our indexes, we'll keep using the same method of reading from the sensitive media tables at indexing time. I think this IP would augment that process with additional checks on the sensitive terms list, in addition to potential changes to how we represent that information in the index (as I suggested above).
@stacimc Heads up that it would be good to keep this IP in mind during the ingestion server removal project. Mainly just want to make sure that we don't make it harder to remove the double-index approach than it is now. I'd be surprised if we managed to make it harder, but worth checking to be sure.
Just dropping a note on how to potentially reduce the subset of works that need to be scanned, it's theoretically possible for us to exclude a subset of providers from sensitive content detection, I think. Nappy and WP Photo Directory for example are definitely safe to skip. Their datasets are relatively small, but even then, I think from a perspective of trying to reduce the false positives, it's a good idea to not use our naive text analysis method when we know that the provider is safe.
We could implement this in the existing reindex query by adding an OR clause to the boolean when creating the filtered index: source in <trusted subset> OR <existing terms query>
For a single index approach, it would just mean skipping whatever text analysis we end up doing when the source is in the trusted subset.
Description
After consulting with some Elasticsearch experts, we've determined that we may be able to improve Elasticsearch performance by using a single index with a
sensitive
keyword rather than two indices (filtered & unfiltered). The existence of two indices that are being regularly queried means that Elasticsearch must be loading records off disk from both indices much more frequently than it would do so on its own. The difference in disk ops between querying two indices and only one index can be seen in this graph:The easiest way to accomplish this might be an extra step during the initial indexing when building the document to feed into Elasticsearch which uses the sensitive terms list to determine a
sensitive
keyword field on the document.It might also be possible to build and index the documents in a first step and default to
sensitive=False
, then process all the documents in a second pass once the index is built. This could query the documents matching the sensitive terms list and update thesensitive
field for those documents (since we've seen they're about 1% of the total index size). We could use the bulk update API for this, although we've been discouraged from doing so since that would be a very long-running request against the cluster.Alternatives
We can leave the cluster as-is with the two index setup; it has not been shown to affect the periodic response time spike we've been seeing recently.
Additional context
The original outline for the two index approach can be found here: https://docs.openverse.org/projects/proposals/trust_and_safety/detecting_sensitive_textual_content/20230330-implementation_plan_filtering_and_designating_results_with_sensitive_textual_content.html