Airflow DAGs and supporting files for running pipelines on Apache Airflow with Elastic Map Reduce.
These scripts have been tested with Airflow (MWAA) and EMR.
This section describes some of the important DAGs in this project.
Steps:
Steps:
A DAG used by the Ingest_all_datasets
DAG to load large numbers of small datasets using a single node cluster in EMR.
This will not run SOLR indexing.
Includes the following options:
load_images
- whether to load images for archivesskip_dwca_to_verbatim
- skip the DWCA to Verbatim stage (which is expensive), and just reprocessA DAG used by the Ingest_all_datasets
DAG to load large numbers of large datasets using a multi node cluster in EMR.
This will not run SOLR indexing.
Includes the following options:
load_images
- whether to load images for archivesskip_dwca_to_verbatim
- skip the DWCA to Verbatim stage (which is expensive), and just reprocessSteps:
load_images
- whether to load images for archivesskip_dwca_to_verbatim
- skip the DWCA to Verbatim stage (which is expensive), and just reprocessrun_index
- whether to run a complete reindex on completion of ingestionSteps:
Run SOLR indexing for single dataset into the live index. This does not run the all dataset processes (Jackknife etc)