NASA-IMPACT / veda-data-airflow

Airflow implementation of ingest pipeline for VEDA STAC data
Other
7 stars 4 forks source link

Basic ingest testing #184

Open anayeaye opened 1 month ago

anayeaye commented 1 month ago

Purpose

Describe how we can test our most basic VEDA ingest use cases when changes are made to airflow and/or the workflows endpoints for two representative VEDA production collections.

Representative collections

  1. A single asset cog_default collection
  2. A multi-asset collection that will cause the discovery step to group s3 objects within a single item record sharing the same datetime instant

Representative properties for most VEDA collections

Most/all VEDA collections have these properties

Missing from this issue (a very incomplete list)

Overview

  1. Choose a production collection in veda-data/ingestion-data/production/collections and create a local copy with a new ID and title for testing (append something obvious -temp-test-copy to the collection id and update the title so it is easy to pick out in the browser).
  2. Find the associated config in veda-data/ingestion-data/production/discovery-items and create a local copy and update the id to match the test collection you just prepared. For many collections, there may also be a staging dataset-config associated with the collection you are copying.
    • create a local copy
    • update the collection id to match your test copy collection
    • you may need to change the discovery bucket to veda-data-store-staging from veda-data-store depending on which mwaa, veda-data-airflow, and veda-backend environment you are testing.
  3. Document what is under test so you can keep track of the various urls of both the backend catalog and the ingest systems under test.
    • STAC_API_URL = ___________
    • INGEST_API_URL = ___________
    • WORKFLOWS_API_URL = ___________
    • AIRFLOW UI URL = ___________
  4. Test out the most commonly used ingest patterns and delete your test collection in between each test (and make sure that the delete operation has completed updating the items partitions table by watching for items added to the collection too quickly when you recreate the test collection in your next test).
    • Pattern 1: manually create the collection via the ingest-api/collections endpoint, then trigger a discovery workflow via the workflows-api/discovery endpoint.
    • Pattern 2: manually create the collection via the ingest-api/collections endpoint, then trigger the veda_discover DAG via the veda-data-airflow UI.
    • Pattern 3: generate a composite dataset config with s3 discovery+pseudo-STAC collection object and submit workflows-api/dataset endpoint which manages the collection creation and then triggers a veda_discover DAG. This is not yet able to handle multi asset collections or additional properties yet.
    • Pattern 4: generate a composite dataset config with s3 discovery+pseudo-STAC collection object manually trigger veda_dataset_pipeline via the airflow UI which handles the collection and then discovery as a subdag operation.
  5. Clean-up; delete the copied test collections.

Examples

Single-asset collection

collection.json ```json { "id": "OMI_trno2-COG-deleteme", "type": "Collection", "links": [], "title": "DELETE ME OMI_trno2", "extent": { "spatial": { "bbox": [ [-180, -90, 180, 90] ] }, "temporal": { "interval": [ [null, null] ] } }, "license": "MIT", "description": "OMI_trno2 - 0.10 x 0.10 Annual as Cloud-Optimized GeoTIFFs (COGs)", "item_assets": { "cog_default": { "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": [ "data", "layer" ], "title": "Default COG Layer", "description": "Cloud optimized default layer to display on map" } }, "stac_version": "1.0.0", "renders": { "dashboard": { "colormap_name": "reds", "rescale": [ [ 0, 3000000000000000.0 ] ], "assets": [ "cog_default" ], "title": "VEDA Dashboard Render Parameters" } }, "providers": [ { "name": "NASA VEDA", "url": "https://www.earthdata.nasa.gov/dashboard/", "roles": [ "host" ] } ], "item_assets": { "test_asset": { "title": "An item asset description for test", "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": ["test"] }, "cog_default": { "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": [ "data", "layer" ], "title": "Default COG Layer", "description": "Cloud optimized default layer to display on map" } }, "assets": { "thumbnail": { "title": "Thumbnail", "description": "Photo by [Mick Truyts](https://unsplash.com/photos/x6WQeNYJC1w) (Power plant shooting steam at the sky)", "href": "https://thumbnails.openveda.cloud/no2--dataset-cover.jpg", "type": "image/jpeg", "roles": ["thumbnail"] } } } ```
discovery-config.json ```json { "collection": "OMI_trno2-COG-deleteme", "bucket": "veda-data-store-staging", "datetime_range": "year", "discovery": "s3", "filename_regex": "^(.*).tif$", "prefix": "OMI_trno2-COG/" } ```
dataset-config.json ```json { "assets": { "thumbnail": { "description": "Photo by [Mick Truyts](https://unsplash.com/photos/x6WQeNYJC1w) (Power plant shooting steam at the sky)", "href": "https://thumbnails.openveda.cloud/no2--dataset-cover.jpg", "roles": [ "thumbnail" ], "title": "Thumbnail", "type": "image/jpeg" } }, "collection": "OMI_trno2-COG-deleteme", "data_type": "cog", "description": "OMI_trno2 - 0.10 x 0.10 Annual as Cloud-Optimized GeoTIFFs (COGs)", "discovery_items": [ { "bucket": "veda-data-store-staging", "datetime_range": "year", "discovery": "s3", "filename_regex": "^(.*).tif$", "prefix": "OMI_trno2-COG/" } ], "is_periodic": true, "license": "MIT", "providers": [ { "name": "NASA VEDA", "roles": [ "host" ], "url": "https://www.earthdata.nasa.gov/dashboard/" } ], "renders": { "dashboard": { "assets": [ "cog_default" ], "colormap_name": "reds", "rescale": [ [ 0, 3000000000000000 ] ], "title": "VEDA Dashboard Render Parameters" } }, "time_density": "year", "title": "DELETE ME OMI_trno2" } ```

multi-asset collection

collection.json ```json { "id": "climdex-tmaxxf-access-cm2-ssp126-deleteme", "type": "Collection", "links": [], "title": "DELETE THIS TEST CLIMDEX ACCESS CM2 SSP125 tmaxXF", "extent": { "spatial": { "bbox": [ [ -180, -90, 180, 90 ] ] }, "temporal": { "interval": [ [ "2015-01-01T00:00:00+00:00", "2101-12-31T23:59:59+00:00" ] ] } }, "license": "CC-BY-SA-4.0", "description": "CLIMDEX ACCESS CM2 SSP125 - variable tmaxXF", "item_assets": { "cog_default": { "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": [ "data", "layer" ], "title": "Default COG Layer", "description": "Cloud optimized default layer to display on map" }, "tmax_above_86": { "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": [ "data", "layer" ], "title": "Tmax Above 86", "description": "Tmax Above 86" }, "tmax_above_90": { "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": [ "data", "layer" ], "title": "Tmax Above 90", "description": "Tmax Above 90" }, "tmax_above_100": { "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": [ "data", "layer" ], "title": "Tmax Above 100", "description": "Tmax Above 100" }, "tmax_above_110": { "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": [ "data", "layer" ], "title": "Tmax Above 110", "description": "Tmax Above 110" }, "tmax_above_115": { "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": [ "data", "layer" ], "title": "Tmax Above 115", "description": "Tmax Above 115" } }, "stac_version": "1.0.0", "dashboard:is_periodic": true, "dashboard:time_density": "year", "providers": [ { "name": "NASA VEDA", "url": "https://www.earthdata.nasa.gov/dashboard/", "roles": [ "host" ] } ], "assets": { "thumbnail": { "title": "Thumbnail", "description": "Photo by NASA (CMIP6 Climdex TmaxXF Screenshot)", "href": "https://thumbnails.openveda.cloud/cmip6-climdex-tmaxxf-access-cm2.png", "type": "image/png", "roles": ["thumbnail"] } } } ```
discovery-config.json ```json { "collection": "climdex-tmaxxf-access-cm2-ssp126-deleteme", "bucket": "veda-data-store-staging", "prefix": "climdex-tmaxxf-access-cm2-ssp126/", "filename_regex": ".*-ssp126_209(.*)_tmax.*.tif$", "id_regex": ".*-ssp126_(.*)_tmax.*.tif$", "id_template": "climdex-tmaxxf-access-cm2-ssp126-{}", "datetime_range": "year", "assets": { "tmax_above_86": { "title": "Tmax Above 86", "description": "Tmax Above 86", "regex": ".*-ssp126_(.*)_tmax_above_86.tif" }, "tmax_above_90": { "title": "Tmax Above 90", "description": "Tmax Above 90", "regex": ".*-ssp126_(.*)_tmax_above_90.tif" }, "tmax_above_100": { "title": "Tmax Above 100", "description": "Tmax Above 100", "regex": ".*-ssp126_(.*)_tmax_above_100.tif" }, "tmax_above_110": { "title": "Tmax Above 110", "description": "Tmax Above 110", "regex": ".*-ssp126_(.*)_tmax_above_110.tif" }, "tmax_above_115": { "title": "Tmax Above 115", "description": "Tmax Above 115", "regex": ".*-ssp126_(.*)_tmax_above_115.tif" } }, "discovery": "s3", "upload": false } ```
dataset-config.json ```json { "collection": "climdex-tmaxxf-access-cm2-ssp126-multi-asset", "data_type": "cog", "spatial_extent": { "xmin": -180, "ymin": -90, "xmax": 180, "ymax": 90 }, "temporal_extent": { "startdate": "2015-01-01T00:00:00Z", "enddate": "2101-12-31T23:59:59Z" }, "description": "CLIMDEX ACCESS CM2 SSP125 - variable tmaxXF", "is_periodic": true, "license": "MIT", "item_assets": { "cog_default": { "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": [ "data", "layer" ], "title": "Default COG Layer", "description": "Cloud optimized default layer to display on map" }, "tmax_above_86": { "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": [ "data", "layer" ], "title": "Tmax Above 86", "description": "Tmax Above 86" }, "tmax_above_90": { "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": [ "data", "layer" ], "title": "Tmax Above 90", "description": "Tmax Above 90" }, "tmax_above_100": { "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": [ "data", "layer" ], "title": "Tmax Above 100", "description": "Tmax Above 100" }, "tmax_above_110": { "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": [ "data", "layer" ], "title": "Tmax Above 110", "description": "Tmax Above 110" }, "tmax_above_115": { "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": [ "data", "layer" ], "title": "Tmax Above 115", "description": "Tmax Above 115" } }, "sample_files": ["s3://veda-data-store-staging/climdex-tmaxxf-access-cm2-ssp126/tmaxXF-ACCESS-CM2-ssp126_2099_tmax_above_86.tif"], "providers": [ { "name": "NASA VEDA", "url": "https://www.earthdata.nasa.gov/dashboard/", "roles": [ "host" ] } ], "renders": { "dashboard": { "assets": [ "cog_default" ], "colormap_name": "reds", "rescale": [ [ 0, 3000000000000000 ] ], "title": "VEDA Dashboard Render Parameters" } }, "assets": { "thumbnail": { "title": "Thumbnail", "description": "Photo by NASA (CMIP6 Climdex TmaxXF Screenshot)", "href": "https://thumbnails.openveda.cloud/cmip6-climdex-tmaxxf-access-cm2.png", "type": "image/png", "roles": ["thumbnail"] } }, "time_density": "year", "title": "DELETE ME CLIMDEX", "discovery_items": [ { "collection": "climdex-tmaxxf-access-cm2-ssp126-deleteme", "bucket": "veda-data-store-staging", "prefix": "climdex-tmaxxf-access-cm2-ssp126/", "filename_regex": ".*-ssp126_209(.*)_tmax.*.tif$", "id_regex": ".*-ssp126_(.*)_tmax.*.tif$", "id_template": "climdex-tmaxxf-access-cm2-ssp126-{}", "datetime_range": "year", "assets": { "tmax_above_86": { "title": "Tmax Above 86", "description": "Tmax Above 86", "regex": ".*-ssp126_(.*)_tmax_above_86.tif" }, "tmax_above_90": { "title": "Tmax Above 90", "description": "Tmax Above 90", "regex": ".*-ssp126_(.*)_tmax_above_90.tif" }, "tmax_above_100": { "title": "Tmax Above 100", "description": "Tmax Above 100", "regex": ".*-ssp126_(.*)_tmax_above_100.tif" }, "tmax_above_110": { "title": "Tmax Above 110", "description": "Tmax Above 110", "regex": ".*-ssp126_(.*)_tmax_above_110.tif" }, "tmax_above_115": { "title": "Tmax Above 115", "description": "Tmax Above 115", "regex": ".*-ssp126_(.*)_tmax_above_115.tif" } }, "discovery": "s3", "upload": false } ] } ```
anayeaye commented 1 month ago

Results

veda-data-airflow veda-pipeline-sit + veda-backend-dev (stac-api and ingest-api) testing for PRs #183 and #159

single-asset collection

Pattern 1: manually create the collection via the ingest-api/collections endpoint, then trigger a discovery workflow via the workflows-api/discovery endpoint.

Pattern 2: manually create the collection via the ingest-api/collections endpoint, then trigger the veda_discover DAG via the veda-data-airflow UI.

Pattern 3: generate a composite dataset config with s3 discovery+pseudo-STAC collection object and submit workflows-api/dataset endpoint.

Pattern 4: generate a composite dataset config with s3 discovery+pseudo-STAC collection object manually trigger veda_dataset_pipeline via the airflow UI.

multi-asset collection

Generated test metadata for future testing and then confirmed that patterns 1 and 2 fail with the same error as the single-asset collection. I haven't had time to generate the composite dataset config input json for this multi asset test collection yet...

anayeaye commented 1 month ago

More test cases! I'm going to link examples here until we setup a home base for workflow test data. https://github.com/NASA-IMPACT/veda-data-airflow/pull/179#issuecomment-2231680684