Increase Metropolitan reingestion timeout

WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.

MIT License

223 stars 179 forks source link

Description

We have a number of failures in the metropolitan_reingestion_workflow caused by AirflowTaskTimeouts during the pull_data step.

Per context in this comment, it's possible this is being caused by the entire DAG timing out, rather than the individual task. We should investigate to make sure.

Possible fixes:

Increasing the task timeout here, if indeed the pull_data step is timing out.

Increase the dagrun_timeout of the entire DAG

Decrease the number of reingestion days for Metropolitan by adjusting the _list_length fields in the (https://github.com/WordPress/openverse-catalog/blob/main/openverse_catalog/dags/providers/provider_reingestion_workflows.py#L80)

While investigating this, I noticed that the Metropolitan reingestion flow is exceeding its dagrun_timeout of 23 hours and skipping many of its reingestion tasks. I was about to make a separate issue for this, but after looking a little closer I think it may actually be the cause of this issue.

The task timeout for the pull_data task for Met is currently 16 hours, but glancing through the logs for the most recent pull_data timeout failures show that they timed out after only 12 hours (although 16 hrs is correctly configured in the Task Instance Details). I think it's possible the tasks are timing out early when the entire DAG times out at 23 hours. We should be able to verify this, but I've removed the help wanted label as this will require investigating the logs in production.

Since the metropolitan reingestion workflow is run @weekly, we can increase the dagrun timeout. Alternatively we should consider reducing the number of reingestion tasks so that it can be completed within the 23hour timeframe, and then update the schedule to @daily.

WordPress / openverse

Increase Metropolitan reingestion timeout #1293

Description