NASA-IMPACT / veda-data

3 stars 0 forks source link

Automate Geoglam & NO2 dataset ingestion #155

Open smohiudd opened 3 months ago

smohiudd commented 3 months ago

Description

NO2 (#89) and Geoglam (#167, #173) datasets requires monthly ingestion as new assets are created. This is currently a manual process however should be automated. veda-data-airflow has a feature that allows scheduled ingestion by creating dataset specific DAGs. The file must still be transferred to the collection s3 bucket. A json file must be uploaded to the airflow event bucket. Here is an example json:

{
    "collection": "emit-ch4plume-v1",
    "bucket": "lp-prod-protected",
    "prefix": "EMITL2BCH4PLM.001/",
    "filename_regex": ".*.tif$",
    "schedule": "00 05 * * *",
    "assets": {
        "ch4-plume-emissions": {
            "title": "EMIT Methane Point Source Plume Complexes",
            "description": "Methane plume complexes from point source emitters.",
            "regex": ".*.tif$"
        }
    }
}

Acceptance Criteria

slesaad commented 3 months ago

Putting the discovery-items config within s3://<EVENT_BUCKET>/collections/ in the following format: https://github.com/US-GHG-Center/ghgc-data/blob/add/lpdaac-dataset-scheduled-config/ingestion-data/discovery-items/scheduled/emit-ch4plume-v1-items.json will trigger the discovery and subsequent ingestion of the collection items based on the schedule attribute

smohiudd commented 2 months ago

mcp-prod will need a new release of airflow to include automated ingestion