Metro-Records / la-metro-dashboard

An Airflow-based dashboard for LA Metro
5 stars 0 forks source link

Tag analytics DAG needs to decrypt files #123

Closed antidipyramid closed 9 months ago

antidipyramid commented 11 months ago

Connected to Metro-Records/la-metro-councilmatic#1043

With Councilmatic's switch to Heroku, we no longer use encrypted settings so we modified DAGs that spin up a Councilmatic countainer to use a new TaggedDockerOperator instead of the old BlackboxDockerOperator.

However, the analytics DAG still needs to decrypt Councilmatic's configs/lametro_service_acct_key.json.gpg in order to upload analytics to Google Drive.

I see a couple ways to address this:

  1. The scraper DAGs still need us to concatenate configs/connection_settings.py to pupa_settings.py. We could create separate Blackbox Docker operators for scraper and councilmatic jobs. The former would update the scrapers' pupa_settings.py as usual and the latter would just run blackbox_postdeploy.

  2. Update pupa_settings.py in the scrapers repo with the required DB configs in configs/connection_settings.py. That way, we don't add an extra operator and can get rid of configs/connection_settings.py and scripts/concat_settings.sh. BlackboxDockerOperator would only be responsible for running blackbox_postdeploy.

Option two seems simpler in the end, to me. Any reason we shouldn't try that first? @hancush @fgregg

fgregg commented 11 months ago

i defer to hannah, but i would be interested in seeing what it could look like to use airflow’s secret facilities. https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/index.html

On Tue, Dec 19, 2023 at 3:30 PM M J @.***> wrote:

With Councilmatic's switch to Heroku, we no longer use encrypted settings so we modified https://github.com/Metro-Records/la-metro-dashboard/commit/872cfd50fdea458a978bd59b614a111ca3e91a89 DAGs that spin up a Councilmatic countainer to use a new TaggedDockerOperator https://github.com/Metro-Records/la-metro-dashboard/blob/cfd01209169beeaa04f52d3cdc386d21e37eb037/plugins/operators/blackbox_docker_operator.py instead of the old BlackboxDockerOperator.

However, the analytics DAG still needs to decrypt Councilmatic's configs/lametro_service_acct_key.json.gpg in order to upload analytics to Google Drive.

I see a couple ways to address this:

1.

The scraper DAGs still need us to concatenate configs/connection_settings.py https://github.com/Metro-Records/la-metro-dashboard/blob/cfd01209169beeaa04f52d3cdc386d21e37eb037/configs/connection_settings.py to pupa_settings.py https://github.com/Metro-Records/scrapers-lametro/blob/9c0ddd1b1b59a9714811d29cf0691ea3181a283a/pupa_settings.py. We could create separate Blackbox Docker operators for scraper and councilmatic jobs. The former would update the scrapers' pupa_settings.py as usual and the latter would just run blackbox_postdeploy. 2.

Update pupa_settings.py https://github.com/Metro-Records/scrapers-lametro/blob/9c0ddd1b1b59a9714811d29cf0691ea3181a283a/pupa_settings.py in the scrapers repo with the required DB configs in configs/connection_settings.py https://github.com/Metro-Records/la-metro-dashboard/blob/cfd01209169beeaa04f52d3cdc386d21e37eb037/configs/connection_settings.py. That way, we don't add an extra operator and can get rid of configs/connection_settings.py and scripts/concat_settings.sh https://github.com/Metro-Records/la-metro-dashboard/blob/cfd01209169beeaa04f52d3cdc386d21e37eb037/scripts/concat_settings.sh. BlackboxDockerOperator would only be responsible for running blackbox_postdeploy.

Option two seems simpler in the end, to me. Any reason we shouldn't try that first? @hancush https://github.com/hancush @fgregg https://github.com/fgregg

— Reply to this email directly, view it on GitHub https://github.com/Metro-Records/la-metro-dashboard/issues/123, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEDC3J6XRU6GDUSTFBC5QTYKH2ONAVCNFSM6AAAAABA3YE67WVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA2DSMZYHEZDCOA . You are receiving this because you were mentioned.Message ID: @.***>

antidipyramid commented 11 months ago

@fgregg @hancush

It's possible to set variables in the Airflow UI (similar to what we currently do with Heroku env vars). These variables are encrypted and can be masked after assignment. We could store the service account key as a variable in Airflow and pass it to the DockerOperator as a env var.

That'd look like this:

default_args = {
    "start_date": START_DATE,
    "execution_timeout": timedelta(minutes=10),
    "image": LA_METRO_IMAGE_URL,
    "environment": {
        **LA_METRO_CONFIGS,
        "DATABASE_URL": LA_METRO_DATABASE_URL,
        "SEARCH_URL": LA_METRO_SEARCH_URL,
        "SENTRY_ENVIRONMENT": ENVIRONMENT,
        "GOOGLE_SERVICE_ACCT_API_KEY": "{{ var.value.google_service_acct_api_key }}", # pull from airflow variables
    },
}

with DAG(
    "tag_analytics",
    catchup=False,
    default_args=default_args,
    schedule_interval="0 0 1 * *",
    description="Generates analytics for Metro agenda tags and"
    "uploads a CSV file to Google Drive",
) as dag:
    t1 = TaggedDockerOperator(
        task_id="generate_tag_analytics",
        command="python manage.py generate_tag_analytics",
    )

We'd just have to amend the Councilmatic app to authenticate to Google Drive with a JSON object parsed from the GOOGLE_SERVICE_ACCT_KEY env var instead of the current file. The upside is that we'd no longer need to use Blackbox in the Councilmatic app at all!

(Note that the docs also recommend using an custom or external secrets manager (e.g. AWS Secrets Manager) for sensitive vars but I'm not sure if we think that's necessary)

fgregg commented 11 months ago

i think that approach makes sense. if we do it we should probably also move other variables out of environment specific settings.py files so we have a consistent way of doing things.

however, this is a big departure from our current pattern, so @hancush should have the last word.

hancush commented 11 months ago

Seems fine to me, so long as whatever we're doing is documented in the README.

I also abdicate having the final word on decision making until I return from leave. Thanks, guys!

fgregg commented 11 months ago

okay, @antidipyramid, let's do a PR where you just handle this secret.

then later, we can do a code quality PR that migrates existing secrets into this pattern.