Closed AetherUnbound closed 1 year ago
For future reference this file is located within the container at /home/airflow/.local/lib/python3.10/site-packages/airflow/models/variable.py
, and you can alter the file to see the same results with airflow dags list
using the following command:
sed -i '136i\ print(f"Running get for {key}!")' /home/airflow/.local/lib/python3.10/site-packages/airflow/models/variable.py
I ran the airflow dags list
command on our current main () and got the following:
airflow@15b47f4014cb:/opt/airflow$ airflow dags list
Running get for GITHUB_API_KEY!
Running get for ENVIRONMENT!
Running get for PR_REVIEW_REMINDER_DRY_RUN!
Running get for GITHUB_API_KEY!
Looks like this is only from the PR review reminder DAG (addressed in WordPress/openverse-catalog#937) and the check_silenced_dags
DAG:
I think the provider DAG refactors addressed this, so we can go ahead and close this issue!
Description
Presently the API keys are pulled from Airflow Variables, typically at the top of a provider script. Here's an example:
https://github.com/WordPress/openverse-catalog/blob/0072114ea479b378a94e1144340935023acbdf3d/openverse_catalog/dags/providers/provider_api_scripts/brooklyn_museum.py#L22
Unfortunately this means that the Variable is retrieved from Airflow (and thus queried from the database) on every DAG parse iteration. It isn't needed until actual DAG runtime, so we should consider deferring its instantiation until the
ProviderDataIngester::__init__
function rather than at the top of the module.Not all of these scripts are converted into the new provider ingester class, so we'll have to wait to complete this ticket until (or as part of) milestone v1.3.2.
Additional context
Below is an example of the current process. I added the following line to Airflow's
variable.py
file within the docker container to track usage.I then ran
airflow dags list
within the container, here was the output:Note that these queries run on every DAG parse cycle.
Implementation