AetherUnbound commented 8 months ago

Description

After consulting with some Elasticsearch experts, we've determined that we may be able to improve Elasticsearch performance by using a single index with a sensitive keyword rather than two indices (filtered & unfiltered). The existence of two indices that are being regularly queried means that Elasticsearch must be loading records off disk from both indices much more frequently than it would do so on its own. The difference in disk ops between querying two indices and only one index can be seen in this graph:

The easiest way to accomplish this might be an extra step during the initial indexing when building the document to feed into Elasticsearch which uses the sensitive terms list to determine a sensitive keyword field on the document.

It might also be possible to build and index the documents in a first step and default to sensitive=False, then process all the documents in a second pass once the index is built. This could query the documents matching the sensitive terms list and update the sensitive field for those documents (since we've seen they're about 1% of the total index size). We could use the bulk update API for this, although we've been discouraged from doing so since that would be a very long-running request against the cluster.

Alternatives

We can leave the cluster as-is with the two index setup; it has not been shown to affect the periodic response time spike we've been seeing recently.

Additional context

The original outline for the two index approach can be found here: https://docs.openverse.org/projects/proposals/trust_and_safety/detecting_sensitive_textual_content/20230330-implementation_plan_filtering_and_designating_results_with_sensitive_textual_content.html

sarayourfriend commented 6 months ago

@AetherUnbound I was wondering if you had any thoughts about when we might prioritise this? I think it would have benefits for service reliability, particularly with index redundancy, if we had fewer indexes spread across the data nodes (we could increase the number of replicas, for example, so that no single node carries a significant amount of any one index).

I also wanted to record some thoughts I had today about how this could work. In particular, regarding the new sensitive field on the index. I'm not sure if you used "keyword" intentionally, Madison, but it would be great to have it actually be a keyword list field, mirroring the "sensitivity_reasons" that we return from the API. So if the work is sensitive due to a confirmed use report, we would add user_reported to the sensitive keyword list. If the work has sensitive text, detected during ingestion, we would add sensitive_text to the sensitive keyword list on the work.

When filtering for non-sensitive works (the default behaviour), instead of using mature=false in the query, we'd query for works where sensitive is an empty list. This makes the sensitivity reasons on the API response trivial construct. It would save at least one ES query for every search, and could save database queries as well for image results and audio results without waveform data.

AetherUnbound commented 5 months ago

That's a good question - I imagine it would affect the work being done as part of #383. Do you think it would make sense to include this in that effort at all, or should it be a separate prioritization?

sarayourfriend commented 5 months ago

Can you specify how you see the relation to the other project? The moderation workflows project wouldn't effect the way user reported sensitivity is detected during ingestion (which is done by checking for an entry in the SensitiveMedia tables). That's remaining the same in the moderation workflows project.

I think this one could be done agnostic of the moderation workflow changes, but I might be missing a connection between them.

AetherUnbound commented 5 months ago

I see! I was thinking it might be involved because both (seemingly, to me) involve modifying the mature/sensitive aspects of the Elasticsearch index, so it might be more ideal to have them occur at the same time. If they can be done distinctly, then perhaps that's best.

sarayourfriend commented 5 months ago

Gotcha. #383 doesn't actually won't change anything about the way we handle mature/sensitive in our indexes, we'll keep using the same method of reading from the sensitive media tables at indexing time. I think this IP would augment that process with additional checks on the sensitive terms list, in addition to potential changes to how we represent that information in the index (as I suggested above).

383 doesn't touch the ingestion server at all in fact, and doesn't modify search either, so this IP and that project can happen in any order without interrupting each other.

sarayourfriend commented 3 months ago

@stacimc Heads up that it would be good to keep this IP in mind during the ingestion server removal project. Mainly just want to make sure that we don't make it harder to remove the double-index approach than it is now. I'd be surprised if we managed to make it harder, but worth checking to be sure.

sarayourfriend commented 1 month ago

Just dropping a note on how to potentially reduce the subset of works that need to be scanned, it's theoretically possible for us to exclude a subset of providers from sensitive content detection, I think. Nappy and WP Photo Directory for example are definitely safe to skip. Their datasets are relatively small, but even then, I think from a perspective of trying to reduce the false positives, it's a good idea to not use our naive text analysis method when we know that the provider is safe.

We could implement this in the existing reindex query by adding an OR clause to the boolean when creating the filtered index: source in <trusted subset> OR <existing terms query>

For a single index approach, it would just mean skipping whatever text analysis we end up doing when the source is in the trusted subset.

WordPress / openverse

Implementation Plan: Use a single index for all API results #3336

Description

Alternatives

Additional context

383 doesn't touch the ingestion server at all in fact, and doesn't modify search either, so this IP and that project can happen in any order without interrupting each other.