Open stacimc opened 1 year ago
While investigating this, I noticed that the Metropolitan reingestion flow is exceeding its dagrun_timeout
of 23 hours and skipping many of its reingestion tasks. I was about to make a separate issue for this, but after looking a little closer I think it may actually be the cause of this issue.
The task timeout for the pull_data
task for Met is currently 16 hours, but glancing through the logs for the most recent pull_data
timeout failures show that they timed out after only 12 hours (although 16 hrs is correctly configured in the Task Instance Details). I think it's possible the tasks are timing out early when the entire DAG times out at 23 hours. We should be able to verify this, but I've removed the help wanted
label as this will require investigating the logs in production.
Since the metropolitan reingestion workflow is run @weekly
, we can increase the dagrun timeout. Alternatively we should consider reducing the number of reingestion tasks so that it can be completed within the 23hour timeframe, and then update the schedule to @daily
.
Description
We have a number of failures in the
metropolitan_reingestion_workflow
caused by AirflowTaskTimeouts during thepull_data
step.Per context in this comment, it's possible this is being caused by the entire DAG timing out, rather than the individual task. We should investigate to make sure.
Possible fixes:
pull_data
step is timing out.dagrun_timeout
of the entire DAG_list_length
fields in the (https://github.com/WordPress/openverse-catalog/blob/main/openverse_catalog/dags/providers/provider_reingestion_workflows.py#L80)