Sensitive terms list produces too many clauses for create filtered index call

AetherUnbound commented 1 year ago

Airflow log link

https://airflow.openverse.engineering/log?execution_date=2023-06-04T00%3A32%3A02.977074%2B00%3A00&task_id=wait_for_create_and_populate_filtered_index&dag_id=create_filtered_audio_index&map_index=-1

Note: Airflow is currently only accessible to maintainers & those given access. If you would like access to Airflow, please reach out to a member of @WordPress/openverse-maintainers.

Exception from the ingestion server:

2023-06-04 00:32:12,485 WARNING base.py:288 - POST http://openverse-elasticsearch-prod.private:9200/_reindex?slices=auto&wait_for_completion=true [status:400 request:0.960s]
2023-06-04 00:32:12,485 ERROR tasks.py:220 - Error processing task `CREATE_AND_POPULATE_FILTERED_INDEX` for `audio`: RequestError(400, 'search_phase_execution_exception', 'too_many_clauses: maxClauseCount is set to 1024')
    2023-06-03T17:32:12.539-07:00   Process Process-5:
    2023-06-03T17:32:12.541-07:00   Traceback (most recent call last):
    2023-06-03T17:32:12.541-07:00   File "/usr/local/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    2023-06-03T17:32:12.541-07:00   self.run()
    2023-06-03T17:32:12.541-07:00   File "/usr/local/lib/python3.11/multiprocessing/process.py", line 108, in run
    2023-06-03T17:32:12.541-07:00   self._target(*self._args, **self._kwargs)
    2023-06-03T17:32:12.541-07:00   File "/ingestion_server/ingestion_server/tasks.py", line 217, in perform_task
    2023-06-03T17:32:12.541-07:00   func(model, **kwargs) # Directly invoke indexer methods if no task function
    2023-06-03T17:32:12.541-07:00   ^^^^^^^^^^^^^^^^^^^^^
    2023-06-03T17:32:12.541-07:00   File "/ingestion_server/ingestion_server/indexer.py", line 525, in create_and_populate_filtered_index
    2023-06-03T17:32:12.541-07:00   self.es.reindex(
    2023-06-03T17:32:12.541-07:00   File "/venv/lib/python3.11/site-packages/elasticsearch/client/utils.py", line 347, in _wrapped
    2023-06-03T17:32:12.541-07:00   return func(*args, params=params, headers=headers, **kwargs)
    2023-06-03T17:32:12.541-07:00   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    2023-06-03T17:32:12.541-07:00   File "/venv/lib/python3.11/site-packages/elasticsearch/client/__init__.py", line 1467, in reindex
    2023-06-03T17:32:12.541-07:00   return self.transport.perform_request(
    2023-06-03T17:32:12.541-07:00   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    2023-06-03T17:32:12.541-07:00   File "/venv/lib/python3.11/site-packages/elasticsearch/transport.py", line 466, in perform_request
    2023-06-03T17:32:12.541-07:00   raise e
    2023-06-03T17:32:12.541-07:00   File "/venv/lib/python3.11/site-packages/elasticsearch/transport.py", line 427, in perform_request
    2023-06-03T17:32:12.541-07:00   status, headers_response, data = connection.perform_request(
    2023-06-03T17:32:12.541-07:00   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    2023-06-03T17:32:12.541-07:00   File "/venv/lib/python3.11/site-packages/elasticsearch/connection/http_requests.py", line 216, in perform_request
    2023-06-03T17:32:12.541-07:00   self._raise_error(response.status_code, raw_data)
    2023-06-03T17:32:12.541-07:00   File "/venv/lib/python3.11/site-packages/elasticsearch/connection/base.py", line 328, in _raise_error
    2023-06-03T17:32:12.541-07:00   raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
    2023-06-03T17:32:12.541-07:00   elasticsearch.exceptions.RequestError: RequestError(400, 'search_phase_execution_exception', 'too_many_clauses: maxClauseCount is set to 1024')

Description

The most recent audio data refresh that ran over the weekend was the first run using the new sensitive terms list (see https://github.com/WordPress/openverse-infrastructure/pull/522). This sensitive terms list (not linked due to the sensitive nature) is almost 2k items long.

The default clause count in ES7 is 1024, but it apparently was raised to 4096 in ES8.

We will need to either find a workaround or increase this configuration value for the cluster in order to proceed.

DAG status

Left enabled since pausing it may affect the data refresh itself.

sarayourfriend commented 1 year ago

This is unfortunate. Increasing the limit to even the ES8 won't help us. As I understand the way the "clauses" calculation works, based on the SO link you shared @AetherUnbound, it's terms * fields. We're querying 3 fields (title, tags, and description), so we'd have to reduce the list to ~1300 for it to even fit into the new default for ES8. Right now we'll have somewhere near 6k clauses (if we have 2k terms).

I can think of at least three options, one of them is horrific but I think would be a hacky workaround. The other would be a lot of work but probably more stable. And the last one is a big unknown, especially because we don't know exactly how to manage updating the cluster settings (maybe it's time to reach out for help provisioning an updated, ES8 cluster).

We could reduce the number of terms. This is a lot of work because we'd need to somehow identify the terms we most definitely want to keep. One option to do this would be to iterate through the terms list and send each one individually as a query then pick the top x terms that would keep us below the limit: i.e., the terms that match the most documents (while still not being ambigious terms).
We could split the reindex into multiple reindex steps. Specifically, we could split the terms list into batches and then reindex over and over again to slowly apply all the terms to the final index. This is probably way easier to implement, but my intuition says it's a horrifying hacky workaround.
We could use the calculation ES itself uses to see of our nodes will be able to handle a higher clause count. If they cannot then we will need to pursue one of the other options no matter what. If they can, then we can reach out for help from our more expert ES friends to apply the new settings without thrashing our cluster. We could also provision a brand new cluster from scratch with ES8 and then do a switchover at some point to the new one and deprovision the old one (after sending a data refresh to the new one).

@zackkrida and @AetherUnbound, what do y'all think? Any ideas on which y'all would prefer?

obulat commented 1 year ago

Considering that Elasticsearch 7 might go EOL by August 2023[^1], I think we should prioritize migrating to ES8 anyways.

I might not be very familiar with the context here, but I think reducing the number of terms is necessary. Creating a DAG that ingests the media sensitive information from the providers might improve the detection of sensitive media, even if it does not help with this specific issue.

[^1]: Elasticsearch EOL docs: https://www.elastic.co/support/eol

zackkrida commented 1 year ago

My initial suggestion would be that we try a test on the staging cluster where we dramatically increase indices.query.bool.max_clause_count for the relevant index.

I believe this is a per-index setting, based on the docs, so perhaps it can be updated without needing to restart the cluster?

Since this will be executed on a single, scheduled query, I suspect it won't be nearly as resource intensive as most use cases for this setting 🤞

sarayourfriend commented 1 year ago

I believe this is a per-index setting, based on the docs, so perhaps it can be updated without needing to restart the cluster?

Which part of the docs says that it is a per-index setting? indices.query.bool.max_clause_count is a global setting, not a per-index setting. Index settings are documented here and there is no per-index analogue for this setting as far as I can tell.

We may be able to use this endpoint to update the setting without restarting, though: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-update-settings.html

However, I don't think updating the setting to 5799 (the current clause count of our query, 1933 terms * 3 fields) will be successful. If you read the ES PR I linked where they changed the calculation to be dynamic you can see it's quite precisely calculated to avoid OOM exceptions. I don't know what happens if we cause an OOM in our staging cluster.

Since this will be executed on a single, scheduled query, I suspect it won't be nearly as resource intensive as most use cases for this setting

My understanding of this setting has to do with whether the actual query even fits into memory in a single thread, at least based on that PR I linked. I'd be surprised if being a non-regular query makes a difference to that characteristic.

I think it would be good to consult with some folks who understand ES better than we do at this point rather than potentially trashing our staging cluster with OOMs. At least I don't know what happens to the cluster if we do that and would prefer to avoid needing to rebuild it.

zackkrida commented 1 year ago

@sarayourfriend I had a thought related to this. These indexes use the same stemmer config as the previous/main/normal indices do, right?

I think that would mean that every phrase in our sensitive terms list gets stemmed, so something like "the foxes jumping quickly" would become [ the, fox, jump, quickli ].

We could potentially leverage this to remove redundancy from the sensitive terms list and get the number of terms down. Our sensitive terms list would become a list of stems and we'd probably be able to remove many duplicates.

The only place it doesn't work well is with the numeric substitutions (d0g) and spelling variants that omit letters.

Edit: I just tested this locally by stemming our entire terms list using the /_analyze ES endpoint on the image-sensitive-text-2023-03-23 staging index.

The list of 1933 produces 1533 unique stems.

Alarmingly, a bunch of these stems are 1-2 leters, like "y", "z", "w", "up". Also innocent words like "love". This makes me wonder, is the sensitive index really matching on these stems?! It seems like this would mean there are many, many false positives.

Apologies if the index or the multimatch query doesn't use stemming for some reason, and this tangent is a huge waste of time.

sarayourfriend commented 1 year ago

Alarmingly, a bunch of these stems are 1-2 leters, like "y", "z", "w", "up". Also innocent words like "love". This makes me wonder, is the sensitive index really matching on these stems?! It seems like this would mean there are many, many false positives.

We quote the terms specifically so that they do not get stemmed for this reason. There are too many opportunities for false positives.

https://github.com//WordPress/openverse/blob/35100d3c719e142da2012303a5f2790c637f737a/ingestion_server/ingestion_server/indexer.py#L522

Multimatch can stem, we just explicitly prevent it from happening by quoting the terms.

The list of 1933 produces 1533 unique stems.

If the second number was way lower, it might be useful if we could identify specific stems that did not overlap with innocuous terms.

Right now my preference is for us to try the first approach: identify the top 341 terms that match the most results (and confirm that there aren't any potential big false positives in the list) and start off by using just those. 341 comes from 1024 / 3, the clause limit over the number of fields we query.

I don't have time to run this experiment right now, but I could start working on it next week. This week I need to finish #2343 and get #2332 merged. I could delay starting on #1969 this week in favour of this. Considering this project is already >50% implemented, it probably makes sense to do so. Let me know what you think about this prioritisation @zackkrida or if you have someone else in mind who could work on this sooner.

zackkrida commented 1 year ago

@sarayourfriend ah, of course, that's what the quoting is for 😌

I think the approach of significantly reducing the list is a sound one. By nature this project is about having the biggest impact to safety possible while understanding it isn't going to capture everything, so working within this limitation feels fine.

I do think it would make sense to wait on #1969 to focus on this project which is closer to completion. The terms experiment seems like it'd be pretty straightforward (write a one-off python script, maybe as a DAG, to iterate through the terms, query ES, and print the record count of each term) but is the amount of time it would take to run feasible? It would have to iterate through all the records 1933 times right? I suppose if batches of the queries were parallelized it might be faster...edit: wait, they're just queries, not reindexing, so it would be much faster than i was originally thinking

Anyway, I think that prioritization makes sense, thanks for confirming.

sarayourfriend commented 1 year ago

wait, they're just queries, not reindexing, so it would be much faster than i was originally thinking

Yup! Individual Multimatch queries should be very fast. I'll just do it in a local Python script (which I'll share in a gist here) and share the results as well.

Thanks for the help prioritising this. I'll do a bit of proofreading and then suspect I should be able to have this in semi-workable state by my afternoon :rocket:

zackkrida commented 1 year ago

@sarayourfriend I got a little excited and just tried an experiment myself. I wrote up some initial findings in the sensitive terms repo warning: contains sensitive text, naturally: https://github.com/WordPress/openverse-sensitive-terms/issues/3

sarayourfriend commented 1 year ago

Thanks! I left a comment on that issue as well after doing a bit of research based on what you found, and we might be able to solve this clause issue by switching to a terms query against new, non-analysed versions of the three fields. I'll update here later if that's possible. It would also solve the problem in the issue you've linked.

sarayourfriend commented 1 year ago

After further investigation, it does look like Zack and I have found an important change that also fixes this issue. Once a text field is analysed by ES, you can only query the tokenised version. That's to say, there's no way to do "exact matches" on an analysed field. In the issue Zack linked (Issue 3 in the sensitive terms repository), we've shown that adding unanalysed versions of the three fields we query for sensitive terms and then switching to ES terms query so that we can find exact matches fixes both the issue Zack found in the sensitive terms repo (lots of false positives due to stemming) as well as fixing the clause limit issue, because terms queries have their own, separate limit.

The solution for this issue, therefore, is to add the following additional fields to title, description and tags.name:

"raw": {
    "type": "text",
    "index": true
}

e.g.,:

"title": {
    "type": "text",
    "similarity": "boolean",
    "fields": {
        "keyword": {
            "type": "keyword",
            "ignore_above": 256
        },
        "raw": {
            "type": "text",
            "index": true
        }
    },
    "analyzer": "custom_english"
},

Additionally, we need to update the query the create_and_populate_filtered_index action uses for the reindex to be:

{
  "bool": {
    "must_not": [
       { "terms": { "${field}.raw": sensitive_terms } }
       for field in ["title", "description", "tags.name"]
    ]
  }
}

In a related issue, this would also fix exact queries in the API if we switched to querying the raw fields and using a terms query when the query string starts and ends with ".

WordPress / openverse