WordPress / openverse-catalog

Identifies and collects data on cc-licensed content across web crawl data and public apis.
https://openverse.org
MIT License
59 stars 54 forks source link

Log last query_params hit before AirflowTaskTimeout #1058

Closed stacimc closed 1 year ago

stacimc commented 1 year ago

Fixes

Description

The ProviderDataIngester includes error handling that, among other things, logs the last query_params reached by the DAG before an error is encountered during ingestion, which is helpful for debugging or resuming a failed DAG. AirflowExceptions are exempted from this custom handling, but that means we also don't get the query_param logging when a task is stopped by Airflow.

This PR changes nothing about the handling of those exceptions, but just adds a log for the last query params hit before raising. This is useful when a pull_data task times out, because we can see exactly where the DAG managed to get to and resume it at that point.

Testing Instructions

Update the pull_timeout for a provider ingester to something small. You can do this in the provider_workflows.py file or via the Airflow variable as described in #976. I updated the Metropolitan museum workflow to have a 1 minute pull timeout.

Then run the DAG locally and wait for the pull_data step to timeout. The task should raise an AirflowTaskTimeout as normal, but when you view the task logs you should be able to scroll up and see a log like:

[2023-03-23, 18:18:41 UTC] {provider_data_ingester.py:219} INFO - Last query_params used: {"metadataDate": "2016-09-04"}

Checklist

[best_practices]: https://git-scm.com/book/en/v2/Distributed-Git-Contributing-to-a-Project#_commit_guidelines

Developer Certificate of Origin

Developer Certificate of Origin ``` Developer Certificate of Origin Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. 1 Letterman Drive Suite D4700 San Francisco, CA, 94129 Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. ```