NASA-IMPACT / veda-data-airflow

Airflow implementation of ingest pipeline for VEDA STAC data
Other
10 stars 4 forks source link

veda-data-airflow

This repo houses function code and deployment code for producing cloud-optimized data products and STAC metadata for interfaces such as https://github.com/NASA-IMPACT/delta-ui.

Project layout

Fetching Submodules

First time setting up the repo: git submodule update --init --recursive

Afterwards: git submodule update --recursive --remote

Requirements

Docker

See get-docker

Terraform

See terraform-getting-started

AWS CLI

See getting-started-install

Setup a local SM2A development environment

  1. Build services

    make sm2a-local-build
  2. Initialize the metadata db

make sm2a-local-init

🚨 NOTE: This command is typically required only once at the beginning. After running it, you generally do not need to run it again unless you run make clean, which will require you to reinitialize SM2A with make sm2a-local-init

This will create an airflow username: airflow with password airflow

  1. Start all services
make sm2a-local-run

This will start SM2A services and will be running on http://localhost:8080

  1. Stop all services
make sm2a-local-stop

Deployment

This project uses Terraform modules to deploy Apache Airflow and related AWS resources using Amazon's managed Airflow provider.

Configure AWS Profile

Ensure that you have an AWS profile configured with the necessary permissions to deploy resources. The profile should be configured in the ~/.aws/credentials file with the profile name being called veda, to match existing .env files.


### Make sure that environment variables are set

[`.env.example`](..env.example) contains the environment variables which are necessary to deploy. Copy this file and update its contents with actual values. The deploy script will `source` and use this file during deployment when provided through the command line:

```bash
# Copy .env.example to a new file
$cp .env.example .env
# Fill values for the environments variables

# Install the deploy requirements
$pip install -r deploy_requirements.txt

# Init terraform modules
$bash ./scripts/deploy.sh .env <<< init

# Deploy
$bash ./scripts/deploy.sh .env <<< deploy

Fetch environment variables using AWS CLI

To retrieve the variables for a stage that has been previously deployed, the secrets manager can be used to quickly populate an .env file with scripts/sync-env-local.sh.

./scripts/sync-env-local.sh <app-secret-name>

[!IMPORTANT] Be careful not to check in .env (or whatever you called your env file) when committing work.

Currently, the client id and domain of an existing Cognito user pool programmatic client must be supplied in configuration as VEDA_CLIENT_ID and VEDA_COGNITO_DOMAIN (the veda-auth project can be used to deploy a Cognito user pool and client). To dispense auth tokens via the workflows API swagger docs, an administrator must add the ingest API lambda URL to the allowed callbacks of the Cognito client.

Gitflow Model

VEDA pipeline gitflow

Ingestion Pipeline Overview

This pipeline is designed to handle the ingestion of both vector and raster data. The ingestion can be performed using the veda-discover DAG. Below are examples of configurations for both vector and raster data.

Ingestion Configuration

Vector Data Ingestion

{
  "collection": "",
  "bucket": "",
  "prefix": "",
  "filename_regex": ".*.csv$",
  "id_template": "-{}",
  "datetime_range": "",
  "vector": true,
  "x_possible": "longitude",
  "y_possible": "latitude",
  "source_projection": "EPSG:4326",
  "target_projection": "EPSG:4326",
  "extra_flags": ["-overwrite", "-lco", "OVERWRITE=YES"]
}

Raster Data Ingestion

{
    "collection": "",
    "bucket": "",
    "prefix": "",
    "filename_regex": ".*.tif$",
    "datetime_range": "",
    "assets": {
        "co2": {
            "title": "",
            "description": ".",
            "regex": ".*.tif$"
        }
    },
    "id_regex": ".*_(.*).tif$",
    "id_template": "-{}"
}

Configuration Fields Description

Pipeline Behaviour

Since this pipeline can ingest both raster and vector data, the configuration can be modified accordingly. The "vector": true triggers the generic_ingest_vector dag. If the collection is provided, it uses the collection name as the table name for ingestion (recommended to use append extra_flag when the collection is provided). When no collection is provided, it uses the id_template and generates a table name by appending the actual ingested filename to the id_template (recommended to use overwrite extra flag).

Setting "vector_eis": true will trigger the EIS Fire specific ingest_vector dag. If neither of these flags is set, the raster ingestion will be triggered, with the configuration typically looking like the raster ingestion example above.

License

This project is licensed under Apache 2, see the LICENSE file for more details.