veda-data-airflow

This repo houses function code and deployment code for producing cloud-optimized data products and STAC metadata for interfaces such as https://github.com/NASA-IMPACT/delta-ui.

Project layout

dags: Contains the Directed Acyclic Graphs which constitute Airflow state machines. This includes the python for running each task as well as the python definitions of the structure of these DAGs
pipeline_tasks: Contains util functions used in python DAGs
data: Contains JSON files which define ingests of collections and items
docker_tasks: Contains definitions tasks which we want to run in docker containers either because these tasks have special, unique dependencies or for the sake of performance (e.g. using multiprocessing)
infrastructure: Contains the terraform modules necessary to deploy all resources to AWS
custom policies: Contains custom policies for the mwaa environment execution role
scripts: Contains bash and python scripts useful for deploying and for running ingests

Fetching Submodules

First time setting up the repo: git submodule update --init --recursive

Afterwards: git submodule update --recursive --remote

Requirements

Setup a local SM2A development environment

Build services
```
make sm2a-local-build
```
Initialize the metadata db

make sm2a-local-init

🚨 NOTE: This command is typically required only once at the beginning. After running it, you generally do not need to run it again unless you run make clean, which will require you to reinitialize SM2A with make sm2a-local-init

This will create an airflow username: airflow with password airflow

Start all services

make sm2a-local-run

This will start SM2A services and will be running on http://localhost:8080

Stop all services

make sm2a-local-stop

Deployment

This project uses Terraform modules to deploy Apache Airflow and related AWS resources using Amazon's managed Airflow provider.

Configure AWS Profile

Ensure that you have an AWS profile configured with the necessary permissions to deploy resources. The profile should be configured in the ~/.aws/credentials file with the profile name being called veda, to match existing .env files.


### Make sure that environment variables are set

[`.env.example`](..env.example) contains the environment variables which are necessary to deploy. Copy this file and update its contents with actual values. The deploy script will `source` and use this file during deployment when provided through the command line:

```bash
# Copy .env.example to a new file
$cp .env.example .env
# Fill values for the environments variables

# Install the deploy requirements
$pip install -r deploy_requirements.txt

# Init terraform modules
$bash ./scripts/deploy.sh .env <<< init

# Deploy
$bash ./scripts/deploy.sh .env <<< deploy

Fetch environment variables using AWS CLI

To retrieve the variables for a stage that has been previously deployed, the secrets manager can be used to quickly populate an .env file with scripts/sync-env-local.sh.

./scripts/sync-env-local.sh <app-secret-name>

[!IMPORTANT] Be careful not to check in .env (or whatever you called your env file) when committing work.

Currently, the client id and domain of an existing Cognito user pool programmatic client must be supplied in configuration as VEDA_CLIENT_ID and VEDA_COGNITO_DOMAIN (the veda-auth project can be used to deploy a Cognito user pool and client). To dispense auth tokens via the workflows API swagger docs, an administrator must add the ingest API lambda URL to the allowed callbacks of the Cognito client.

Gitflow Model

VEDA pipeline gitflow

Ingestion Pipeline Overview

This pipeline is designed to handle the ingestion of both vector and raster data. The ingestion can be performed using the veda-discover DAG. Below are examples of configurations for both vector and raster data.

Ingestion Configuration

Vector Data Ingestion

{
  "collection": "",
  "bucket": "",
  "prefix": "",
  "filename_regex": ".*.csv$",
  "id_template": "-{}",
  "datetime_range": "",
  "vector": true,
  "x_possible": "longitude",
  "y_possible": "latitude",
  "source_projection": "EPSG:4326",
  "target_projection": "EPSG:4326",
  "extra_flags": ["-overwrite", "-lco", "OVERWRITE=YES"]
}

Raster Data Ingestion

{
    "collection": "",
    "bucket": "",
    "prefix": "",
    "filename_regex": ".*.tif$",
    "datetime_range": "",
    "assets": {
        "co2": {
            "title": "",
            "description": ".",
            "regex": ".*.tif$"
        }
    },
    "id_regex": ".*_(.*).tif$",
    "id_template": "-{}"
}

Configuration Fields Description

collection: The collection_id of the raster or vector data.
bucket: The name of the S3 bucket where the data is stored.
prefix: The location within the bucket where the files are to be discovered.
filename_regex: A regex expression used to filter files based on naming patterns.
id_template: The format used to create item identifiers in the system.
vector: Set to true to trigger the generic vector ingestion pipeline.
vector_eis: Set to true to trigger the EIS Fire specific vector ingestion pipeline.

Pipeline Behaviour

Since this pipeline can ingest both raster and vector data, the configuration can be modified accordingly. The "vector": true triggers the generic_ingest_vector dag. If the collection is provided, it uses the collection name as the table name for ingestion (recommended to use append extra_flag when the collection is provided). When no collection is provided, it uses the id_template and generates a table name by appending the actual ingested filename to the id_template (recommended to use overwrite extra flag).

Setting "vector_eis": true will trigger the EIS Fire specific ingest_vector dag. If neither of these flags is set, the raster ingestion will be triggered, with the configuration typically looking like the raster ingestion example above.

License

This project is licensed under Apache 2, see the LICENSE file for more details.

NASA-IMPACT / veda-data-airflow

readme