:blowfish: USF IMaRS Airflow DAGs
Don't hesitate to open an issue if you are confused or something isn't working for you. This is under heavy development so documentation and code likely contain errors.
Please see details in the (private to IMaRS users) DAG Development Workflows document.
The short version of the simplest workflow:
Additional documentation in the ./docs
directory.
this_dag
.task_id
string matches the operator variable name.Data procesing DAGs in general follow the ETL pattern by using the helpers in
dags/util/etl_tools/
.
A typical automated processing pipeline should look something like:
[external data src]---(ingest DAG)--->(imars-etl.load)<---------------------\
| \
[imars_product_metadata db]<---|--->[IMaRS object storage] \
| | \
(product FileTriggerDAG) (imars-etl.extract) |
| | |
|--->(product processing DAG)<--\ | |
|--->(product processing DAG)<---[local copy of input data]<---| |
|--->(product processing DAG)<--/ |
|---> ... \\\ |
\--->[local copy of output data]---------|
Within this there are 3 types of DAGs to be defined:
A MySQL database of IMaRS data/image products is maintained independently of airflow. This database (imars_product_metadata) contains information about the data products like the datetime of each granule, the product "type", and coverage areas. This information can be searched by connecting to the database directly, or through the use of the imars-etl package. This database serves two functions:
FileTriggerDAG
Using imars-etl
is critical for fetching and uploading IMaRS data products.
"ETL" is short for Extract-Transform-Load and this describes data processing in
general:
The imars-etl package aims to simplify the "extract" and "load" steps by hiding the complexity of IMaRS' data systems behind a nice CLI.
To make things even more simple for airflow DAGs ./dags/util/etl_tools
includes some helper functions to set up imars-etl operators automatically.
The helper will add extract, load, and cleanup operators to your DAG to wrap
around your processing operators like so:
(imars-etl extract)-->(your processing operators)-->(imars-etl load)
\ \
\-----------------------------------------------\-->(clean up local files)
A FileTriggerDAG
is a DAG which checks the IMaRS product metadata database for
new files and starts up processing DAGs.
installation should be handled by the imars_airflow puppet module, but here are the general steps:
pip install -e .
using setup.pyAll tests included here can be run using pytest.
IMPORTANT: you must run pytest
from the parent directory of this repo, and you must use the python -m pytest ...
syntax.
For example: if imars_dags
is in /home/dags/imars_dags
:
cd /home/dags
python3 -m pytest ./imars_dags/
If you do not run pytest in this way your tests will throw ImportErrors because of the unusual layout of this repo.
include this from a pip requirements.txt like:
-e git://github.com/USF-IMARS/imars_dags.git@master#egg=imars_airflow_config