cc-archive / cccatalog

[PROJECT TRANSFERRED] Mapping the commons towards an open ledger and cc search.
https://github.com/WordPress/openverse-catalog
MIT License
63 stars 60 forks source link

[Feature] Apache Airflow DAG to run the new metropolitan_museum_of_art.py script. #358

Closed mathemancer closed 4 years ago

mathemancer commented 4 years ago

Problem Description

In order to get the new metropolitian_museum_of_art.py script (see #278) into production, we need to implement a new Apache Airflow DAG that will run the script.

Solution Description

Implement such a DAG. For examples, see src/cc_catalog_airflow/dags/flickr_workflow.py and src/cc_catalog_airflow/dags/wikimedia_workflow.py. This DAG should be configured to run the main function from src/cc_catalog_airflow/dags/metropolitan_museum_of_art.py with the date parameter, once per day. It should have catchup=False. The concurrency and max_active_runs parameters should both be 1.

Alternatives

We may replace this DAG with some kind of DAG factory in the future, so it should be considered somewhat temporary.

kss682 commented 4 years ago

start date for flickr is 1970 because they have few images at that time , is there any such date for metropolitan

mathemancer commented 4 years ago

You can use 2020-01-01. We'll never need to run it back further than that date. The reason is that the 'date' parameter for this script actually pulls metadata for all images that's been updated since the given date, and the metadata for all images has been updated since the beginning of this year.