WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
247 stars 199 forks source link

Some Metropolitan reingestion days fail due to DuplicateTable #1322

Closed stacimc closed 1 year ago

stacimc commented 1 year ago

Description

On each run of the metropolitan_reingestion_workflow, a handful of reingestion days fail during the create_loading_table step due to the error:

psycopg2.errors.DuplicateTable: relation "provider_data_image_metropolitan_museum_reingestion_20221204t00" already exists

In each case I checked, this is the first error in the create_loading_table task (ie it's not only throwing this on retries).

Reproduction

See day_shift_2 or day_shift_1 for the 2022-12-04 run in production.

stacimc commented 1 year ago

Adding to the DAG Stability milestone as this is high priority.

AetherUnbound commented 1 year ago

Similar to WordPress/openverse#1417, I wonder if adding the day shift to the table name would make the value sufficiently unique enough to prevent this issue.

stacimc commented 1 year ago

We do already append the day_shift to the load table name, via the identifier param created here, though 🤔

I could see there being an issue specifically on day_shift 0 because that ends up being appended for both a normal ingestion run and for the first run of a reingestion workflow, so it makes sense to add some logic to only append the day_shift here if it's a reingestion run. But I'm not immediately sure why this is an issue, much less only on certain days. From the logs for the 2022-10-30 run, create_loading_table_day_shift_4 task:

[2022-11-02, 05:03:42 UTC] {sql.py:315} INFO - Running statement: 
  CREATE TABLE public.provider_data_image_metropolitan_museum_reingestion_20221023T000000_4(
  foreign_identifier character varying(3000),
  [...]

[2022-11-02, 05:03:42 UTC] {taskinstance.py:1851} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 175, in execute
    return_value = self.execute_callable()
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 193, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/usr/local/airflow/openverse_catalog/dags/common/loader/sql.py", line 95, in create_loading_table
    postgres.run(table_creation_query)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/common/sql/hooks/sql.py", line 295, in run
    self._run_command(cur, sql_statement, parameters)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/common/sql/hooks/sql.py", line 320, in _run_command
    cur.execute(sql_statement)
psycopg2.errors.DuplicateTable: relation "provider_data_image_metropolitan_museum_reingestion_20221023t00" already exists

So it looks like identifier is passed in appropriately in the logged CREATE command, yet the table name in the DuplicateTable error is truncated.

That particular example is notable because it succeeded on its third attempt.

AetherUnbound commented 1 year ago

Oh dear 😅 if it's getting truncated then we must be hitting the maximum table name length for Postgres^1... (emphasis mine)

The system uses no more than NAMEDATALEN-1 bytes of an identifier; longer names can be written in commands, but they will be truncated. By default, NAMEDATALEN is 64 so the maximum identifier length is 63 bytes. If this limit is problematic, it can be raised by changing the NAMEDATALEN constant in src/include/pg_config_manual.h.

Looks like we may need to figure out a shorthand for these temp tables 😬