astronomer / dag-factory

Dynamically generate Apache Airflow DAGs from YAML configuration files
Apache License 2.0
1.21k stars 182 forks source link

[Feature] Shared defaults for load_yaml_dags #297

Open wearpants opened 4 days ago

wearpants commented 4 days ago

Description

It'd be nice to pass some shared default_args for a directory, either via a python object or a defaults.yml file in the directory.

Use case/motivation

One DAG per file is easier for users IMO, and as a system administrator I'd like to be able to give them a shared set of pre-baked defaults (env vars, etc.)

Related issues

289, #290

Are you willing to submit a PR?

cmarteepants commented 4 days ago

Something like this?

./dags/default.yml

default:
  catchup: false
  default_args:
    start_date: "2024-01-01"
  schedule_interval: "0 0 * * *"
  tasks:
    extract:
      operator: airflow.operators.python.PythonOperator
      python_callable_file: /usr/local/airflow/include/etl_helpers.py
      python_callable_name: extract_helper
    load:
      dependencies:
      - transform
      operator: airflow.operators.python.PythonOperator
      python_callable_file: /usr/local/airflow/include/etl_helpers.py
      python_callable_name: load_helper
    transform:
      dependencies:
      - extract
      op_kwargs:
        ds_nodash: '{{ds_nodash}}'
      operator: airflow.operators.python.PythonOperator
      python_callable_file: /usr/local/airflow/include/etl_helpers.py
      python_callable_name: transform_helper

./dags/bi.yml

business_analytics:
  schedule_interval: "@daily"
  tasks:
    load:
      op_kwargs:
        database_name: BA
        table_name: inventory

./dags/ds.yml

data_science:
  tasks:
    load:
      op_kwargs:
        database_name: DS
        table_name: daily_sales

./dags/ml.yml

machine_learning:
  tasks:
    load:
      op_kwargs:
        database_name: ML
        table_name: training_data

...

jroach-astronomer commented 4 days ago

@cmarteepants, only thing I think I'd add here is referencing the default values in .dags/bi.yml, etc.

wearpants commented 3 days ago

@cmarteepants So basically bi.yml etc are merged on top of default.yml? Could you clarify how that works - does that happen for the entire yaml object tree key-by-key / lists extended / etc? How would you do overrrides? (Take a look at ChainMap for a simple comparison).

I had been mainly thinking of this only for default_args and that defaults.yml wouldn't provide any tasks (could use cross-dag dependencies for that)... but if defaults.yml is more like a template / base class that can be extended/overriden, that opens up some interesting possibilities, but not totally clear how that would work.

Docker compose does something similar, but the merge rules are kind of adhoc-yet-sensible

cmarteepants commented 3 days ago

@wearpants If everything is contained in the same yaml today, yes anything in default is more like a template that can be extended AND overridden.

As for the how? I'll be honest: I haven't delved much into the source code to understand how this was implemented. Could be something we are getting "for free" from pyyaml, but never looked into it as the capability was around from before Astronomer took over the project. I opened up issue #295 so we can we document this properly. The examples in the issue are for extending, overriding and even generating the exact same dag structure with different task ids, and they all work.

I really like your idea about splitting up the definitions into different files though, and allowing for different defaults per folders. I'd even go so far as push that as a best practice. We'd need to allow for an order of precedence, but assuming we can pull it off (and I don't see why not, but I'm the PM :D) I agree, I think it would be really powerful.

I'll have someone on the engineering team start looking into this within the next few sprints. Do you want to be kept to update as things progress?

wearpants commented 3 days ago

@cmarteepants yes please keep me in the loop, happy to hop on a quick design brainstorming session call as well if that be helpful