WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
223 stars 179 forks source link

Increase Metropolitan reingestion timeout #1293

Open stacimc opened 1 year ago

stacimc commented 1 year ago

Description

We have a number of failures in the metropolitan_reingestion_workflow caused by AirflowTaskTimeouts during the pull_data step.

Per context in this comment, it's possible this is being caused by the entire DAG timing out, rather than the individual task. We should investigate to make sure.

Possible fixes:

stacimc commented 1 year ago

While investigating this, I noticed that the Metropolitan reingestion flow is exceeding its dagrun_timeout of 23 hours and skipping many of its reingestion tasks. I was about to make a separate issue for this, but after looking a little closer I think it may actually be the cause of this issue.

The task timeout for the pull_data task for Met is currently 16 hours, but glancing through the logs for the most recent pull_data timeout failures show that they timed out after only 12 hours (although 16 hrs is correctly configured in the Task Instance Details). I think it's possible the tasks are timing out early when the entire DAG times out at 23 hours. We should be able to verify this, but I've removed the help wanted label as this will require investigating the logs in production.

Since the metropolitan reingestion workflow is run @weekly, we can increase the dagrun timeout. Alternatively we should consider reducing the number of reingestion tasks so that it can be completed within the 23hour timeframe, and then update the schedule to @daily.