AtlasOfLivingAustralia / pipelines-airflow

About Airflow DAGs and supporting files for running pipelines on Apache Airflow with Elastic Map Reduce.
Other
0 stars 1 forks source link
airflow spark

pipelines-airflow

Airflow DAGs and supporting files for running pipelines on Apache Airflow with Elastic Map Reduce.

Installation

These scripts have been tested with Airflow (MWAA) and EMR.

Screen Shot 2022-03-02 at 1 52 28 pm

DAGS

This section describes some of the important DAGs in this project.

load_dataset_dag.py

Steps:

load_provider_dag.py

Steps:

load_provider

ingest_small_datasets_dag.py

A DAG used by the Ingest_all_datasets DAG to load large numbers of small datasets using a single node cluster in EMR. This will not run SOLR indexing. Includes the following options:

ingest_large_datasets_dag.py

A DAG used by the Ingest_all_datasets DAG to load large numbers of large datasets using a multi node cluster in EMR. This will not run SOLR indexing. Includes the following options:

ingest_all_datasets_dag.py

Steps:

Screen Shot 2022-03-16 at 12 52 42 pm

full_index_to_solr.py

Steps:

solr_dataset_indexing

Run SOLR indexing for single dataset into the live index. This does not run the all dataset processes (Jackknife etc)