WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
236 stars 187 forks source link

Move API key Variable access to provider ingester init function #1457

Closed AetherUnbound closed 1 year ago

AetherUnbound commented 2 years ago

Description

Presently the API keys are pulled from Airflow Variables, typically at the top of a provider script. Here's an example:

https://github.com/WordPress/openverse-catalog/blob/0072114ea479b378a94e1144340935023acbdf3d/openverse_catalog/dags/providers/provider_api_scripts/brooklyn_museum.py#L22

Unfortunately this means that the Variable is retrieved from Airflow (and thus queried from the database) on every DAG parse iteration. It isn't needed until actual DAG runtime, so we should consider deferring its instantiation until the ProviderDataIngester::__init__ function rather than at the top of the module.

Not all of these scripts are converted into the new provider ingester class, so we'll have to wait to complete this ticket until (or as part of) milestone v1.3.2.

Additional context

Below is an example of the current process. I added the following line to Airflow's variable.py file within the docker container to track usage.

$ diff -Naur /home/madison/programs/anaconda3/envs/openverse-catalog/lib/python3.10/site-packages/airflow/models/variable.py variable.py 
--- /home/madison/programs/anaconda3/envs/openverse-catalog/lib/python3.10/site-packages/airflow/models/variable.py 2022-07-13 15:23:29.445279243 -0700
+++ variable.py 2022-08-19 15:38:52.992392655 -0700
@@ -130,6 +130,7 @@
         :param default_var: Default value of the Variable if the Variable doesn't exist
         :param deserialize_json: Deserialize the value to a Python dict
         """
+        print(f"Running get for {key}!")
         var_val = Variable.get_variable_from_secrets(key=key)
         if var_val is None:
             if default_var is not cls.__NO_DEFAULT_SENTINEL:

I then ran airflow dags list within the container, here was the output:

airflow@604f79cb5c26:~$ airflow dags list
/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/utils/log/file_task_handler.py:52 DeprecationWarning: Passing filename_template to FileTaskHandler is deprecated and has no effect
Running get for API_KEY_EUROPEANA!
Running get for API_KEY_FLICKR!
Running get for API_KEY_BROOKLYN_MUSEUM!
Running get for API_KEY_FREESOUND!
Running get for API_KEY_JAMENDO!
Running get for API_KEY_NYPL!
Running get for API_KEY_DATA_GOV!
Running get for API_KEY_WALTERS_ART_MUSEUM!
Running get for ENVIRONMENT!
Running get for PR_REVIEW_REMINDER_DRY_RUN!
Running get for GITHUB_API_KEY!
Running get for GITHUB_API_KEY!
dag_id                                | filepath                                                   | owner          | paused
======================================+============================================================+================+=======
airflow_log_cleanup                   | maintenance/airflow_log_cleanup_workflow.py                | data-eng-admin | True  
audio_data_refresh                    | data_refresh/dag_factory.py                                | data-eng-admin | True  
brooklyn_museum_workflow              | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
check_silenced_dags                   | maintenance/check_silenced_dags/check_silenced_dags_dag.py | data-eng-admin | True  
cleveland_museum_workflow             | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
commoncrawl_etl_workflow              | commoncrawl/commoncrawl_etl.py                             | data-eng-admin | True  
europeana_ingestion_workflow          | providers/provider_ingestion_workflow_dag_factory.py       | data-eng-admin | True  
europeana_workflow                    | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
finnish_museums_workflow              | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
flickr_ingestion_workflow             | providers/provider_ingestion_workflow_dag_factory.py       | data-eng-admin | True  
flickr_workflow                       | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
freesound_workflow                    | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
image_data_refresh                    | data_refresh/dag_factory.py                                | data-eng-admin | True  
image_expiration_workflow             | database/image_expiration_workflow.py                      | data-eng-admin | True  
inaturalist_workflow                  | providers/provider_workflow_dag_factory.py                 | data-eng-admin | None  
jamendo_workflow                      | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
metropolitan_museum_workflow          | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
museum_victoria_workflow              | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
nypl_workflow                         | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
oauth2_authorization                  | oauth2/authorize_dag.py                                    | data-eng-admin | True  
oauth2_token_refresh                  | oauth2/token_refresh_dag.py                                | data-eng-admin | True  
phylopic_workflow                     | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
pr_review_reminders                   | maintenance/pr_review_reminders/pr_review_reminders_dag.py | data-eng-admin | True  
rawpixel_workflow                     | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
recreate_audio_popularity_calculation | database/recreate_popularity_calculation_dag_factory.py    | data-eng-admin | True  
recreate_image_popularity_calculation | database/recreate_popularity_calculation_dag_factory.py    | data-eng-admin | True  
report_pending_reported_media         | database/report_pending_reported_media.py                  | data-eng-admin | True  
science_museum_workflow               | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
smithsonian_workflow                  | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
smk_workflow                          | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
stocksnap_workflow                    | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
sync_commoncrawl_workflow             | commoncrawl/sync_commoncrawl_workflow.py                   | data-eng-admin | True  
tsv_to_postgres_loader                | database/loader_workflow.py                                | data-eng-admin | True  
walters_workflow                      | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
wikimedia_commons_workflow            | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  
wikimedia_ingestion_workflow          | providers/provider_ingestion_workflow_dag_factory.py       | data-eng-admin | True  
wordpress_workflow                    | providers/provider_workflow_dag_factory.py                 | data-eng-admin | True  

Note that these queries run on every DAG parse cycle.

Implementation

AetherUnbound commented 1 year ago

For future reference this file is located within the container at /home/airflow/.local/lib/python3.10/site-packages/airflow/models/variable.py, and you can alter the file to see the same results with airflow dags list using the following command:

sed -i '136i\        print(f"Running get for {key}!")' /home/airflow/.local/lib/python3.10/site-packages/airflow/models/variable.py
AetherUnbound commented 1 year ago

I ran the airflow dags list command on our current main () and got the following:

airflow@15b47f4014cb:/opt/airflow$ airflow dags list
Running get for GITHUB_API_KEY!
Running get for ENVIRONMENT!
Running get for PR_REVIEW_REMINDER_DRY_RUN!
Running get for GITHUB_API_KEY!

Looks like this is only from the PR review reminder DAG (addressed in WordPress/openverse-catalog#937) and the check_silenced_dags DAG:

https://github.com/WordPress/openverse-catalog/blob/382192584f6004537985f754f991e4d51cdbfa85/openverse_catalog/dags/maintenance/check_silenced_dags.py#L37

I think the provider DAG refactors addressed this, so we can go ahead and close this issue!