NASA-IMPACT / veda-data-airflow

Airflow implementation of ingest pipeline for VEDA STAC data
Other
7 stars 4 forks source link

Comparison of data ingest using Airflow vs Legacy (Step Functions) #7

Open slesaad opened 1 year ago

slesaad commented 1 year ago

Description

As a validation of the ingests done using the [1] new airflow based pipeline, this issue runs the ingestion using both [1] and [2] legacy step functions based pipeline.

The ingest is initiated via the veda-stac-ingestor api, endpoint /dataset/publish with the same inputs for both except the collection id, as can be seen below:

For [2], the input was:

{
  "collection": "lis-global-da-tws-trend",
  "title": "Terrestrial Water Storage Trend - LIS 10km Global DA",
  "description": "Gridded trend in terrestrial water storage (theil-sen slope estimation in mm yr-1) from 10km global LIS with assimilation",
  "license": "CC0-1.0",
  "is_periodic": false,
  "time_density": null,
  "spatial_extent": {
    "xmin": -179.95,
    "ymin": -59.45,
    "xmax": 179.95,
    "ymax": 83.55
  },
  "temporal_extent": {
    "startdate": "2003-01-01T00:00:00Z",
    "enddate": "2021-12-31T23:59:59Z"
  },
  "sample_files": [
    "s3://veda-data-store-staging/EIS/COG/LIS_GLOBAL_DA/DA_Trends/DATWS_STL_based_trend.cog.tif"
  ],
  "discovery_items": [
    {
      "collection": "lis-global-da-tws-trend-airflow",
      "discovery": "s3",
      "cogify": false,
      "upload": false,
      "dry_run": false,
      "prefix": "EIS/COG/LIS_GLOBAL_DA/DA_Trends/",
      "bucket": "veda-data-store-staging",
      "filename_regex": "(.*)DATWS_STL_based_trend.cog.tif$",
      "start_datetime": "2003-01-01T00:00:00Z",
      "end_datetime": "2021-12-31T23:59:59Z"
    }
  ]
}

Similarly, for [1], the input was:

{
  "collection": "lis-global-da-tws-trend-airflow",
  "title": "Terrestrial Water Storage Trend - LIS 10km Global DA",
  "description": "Gridded trend in terrestrial water storage (theil-sen slope estimation in mm yr-1) from 10km global LIS with assimilation",
  "license": "CC0-1.0",
  "is_periodic": false,
  "time_density": null,
  "spatial_extent": {
    "xmin": -179.95,
    "ymin": -59.45,
    "xmax": 179.95,
    "ymax": 83.55
  },
  "temporal_extent": {
    "startdate": "2003-01-01T00:00:00Z",
    "enddate": "2021-12-31T23:59:59Z"
  },
  "sample_files": [
    "s3://veda-data-store-staging/EIS/COG/LIS_GLOBAL_DA/DA_Trends/DATWS_STL_based_trend.cog.tif"
  ],
  "discovery_items": [
    {
      "collection": "lis-global-da-tws-trend-airflow",
      "discovery": "s3",
      "cogify": false,
      "upload": false,
      "dry_run": false,
      "prefix": "EIS/COG/LIS_GLOBAL_DA/DA_Trends/",
      "bucket": "veda-data-store-staging",
      "filename_regex": "(.*)DATWS_STL_based_trend.cog.tif$",
      "start_datetime": "2003-01-01T00:00:00Z",
      "end_datetime": "2021-12-31T23:59:59Z"
    }
  ]
}

After the ingestion run was done, the stac records for both were compared and they look like the following:

Collection

[1]

{
  "id": "lis-global-da-tws-trend-airflow",
  "type": "Collection",
  "links": [
    {
      "rel": "items",
      "type": "application/geo+json",
      "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend-airflow/items"
    },
    {
      "rel": "parent",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/"
    },
    {
      "rel": "root",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/"
    },
    {
      "rel": "self",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend-airflow"
    }
  ],
  "title": "Terrestrial Water Storage Trend - LIS 10km Global DA",
  "assets": null,
  "extent": {
    "spatial": {
      "bbox": [
        [
          -179.9500000157243, -59.98224871364589, 179.9973980503783,
          89.9999999874719
        ]
      ]
    },
    "temporal": {
      "interval": [["2003-01-01 00:00:00+00", "2003-01-01 00:00:00+00"]]
    }
  },
  "license": "CC0-1.0",
  "keywords": null,
  "providers": null,
  "summaries": {
    "datetime": ["2003-01-01T00:00:00Z"],
    "cog_default": {
      "max": 101.29833221435547,
      "min": -555
    }
  },
  "description": "Gridded trend in terrestrial water storage (theil-sen slope estimation in mm yr-1) from 10km global LIS with assimilation",
  "item_assets": {
    "cog_default": {
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "roles": ["data", "layer"],
      "title": "Default COG Layer",
      "description": "Cloud optimized default layer to display on map"
    }
  },
  "stac_version": "1.0.0",
  "stac_extensions": null,
  "dashboard:is_periodic": false,
  "dashboard:time_density": null
}

[2]

{
  "id": "lis-global-da-tws-trend",
  "type": "Collection",
  "links": [
    {
      "rel": "items",
      "type": "application/geo+json",
      "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend/items"
    },
    {
      "rel": "parent",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/"
    },
    {
      "rel": "root",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/"
    },
    {
      "rel": "self",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend"
    }
  ],
  "title": "Terrestrial Water Storage Trend - LIS 10km Global DA",
  "assets": null,
  "extent": {
    "spatial": {
      "bbox": [
        [
          -179.9500000157243, -59.98224871364589, 179.9973980503783,
          89.9999999874719
        ]
      ]
    },
    "temporal": {
      "interval": [["2003-01-01 00:00:00+00", "2003-01-01 00:00:00+00"]]
    }
  },
  "license": "CC0-1.0",
  "keywords": null,
  "providers": null,
  "summaries": {
    "datetime": ["2003-01-01T00:00:00Z"],
    "cog_default": {
      "max": 101.29833221435547,
      "min": -555
    }
  },
  "description": "Gridded trend in terrestrial water storage (theil-sen slope estimation in mm yr-1) from 10km global LIS with assimilation",
  "item_assets": {
    "cog_default": {
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "roles": ["data", "layer"],
      "title": "Default COG Layer",
      "description": "Cloud optimized default layer to display on map"
    }
  },
  "stac_version": "1.0.0",
  "stac_extensions": null,
  "dashboard:is_periodic": false,
  "dashboard:time_density": null
}

Items

[1]

{
  "type": "FeatureCollection",
  "context": {
    "limit": 10,
    "matched": 0,
    "returned": 1
  },
  "features": [
    {
      "id": "DATWS_STL_based_trend.cog",
      "bbox": [
        -179.9500000157243, -59.98224871364589, 179.9973980503783,
        89.9999999874719
      ],
      "type": "Feature",
      "links": [
        {
          "rel": "collection",
          "type": "application/json",
          "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend-airflow"
        },
        {
          "rel": "parent",
          "type": "application/json",
          "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend-airflow"
        },
        {
          "rel": "root",
          "type": "application/json",
          "href": "https://dev-stac.delta-backend.com/"
        },
        {
          "rel": "self",
          "type": "application/geo+json",
          "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend-airflow/items/DATWS_STL_based_trend.cog"
        }
      ],
      "assets": {
        "cog_default": {
          "href": "s3://veda-data-store-staging/EIS/COG/LIS_GLOBAL_DA/DA_Trends/DATWS_STL_based_trend.cog.tif",
          "type": "image/tiff; application=geotiff; profile=cloud-optimized",
          "roles": ["data", "layer"],
          "title": "Default COG Layer",
          "description": "Cloud optimized default layer to display on map",
          "raster:bands": [
            {
              "scale": 1.0,
              "nodata": 0.0,
              "offset": 0.0,
              "sampling": "area",
              "data_type": "float64",
              "histogram": {
                "max": 101.29833221435547,
                "min": -555.0,
                "count": 11.0,
                "buckets": [
                  7843.0, 0.0, 2.0, 13.0, 24.0, 77.0, 353.0, 1228.0, 118651.0,
                  9.0
                ]
              },
              "statistics": {
                "mean": -36.01088186359726,
                "stddev": 133.02156258224915,
                "maximum": 101.29833221435547,
                "minimum": -555.0,
                "valid_percent": 29.319745316159253
              }
            }
          ]
        }
      },
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [-179.9500000157243, -59.98224871364589],
            [179.9973980503783, -59.98224871364589],
            [179.9973980503783, 89.9999999874719],
            [-179.9500000157243, 89.9999999874719],
            [-179.9500000157243, -59.98224871364589]
          ]
        ]
      },
      "collection": "lis-global-da-tws-trend-airflow",
      "properties": {
        "proj:bbox": [
          -179.9500000157243, -59.98224871364589, 179.9973980503783,
          89.9999999874719
        ],
        "proj:epsg": 4326.0,
        "proj:shape": [1500.0, 3600.0],
        "end_datetime": "2021-12-31T23:59:59+00:00",
        "proj:geometry": {
          "type": "Polygon",
          "coordinates": [
            [
              [-179.9500000157243, -59.98224871364589],
              [179.9973980503783, -59.98224871364589],
              [179.9973980503783, 89.9999999874719],
              [-179.9500000157243, 89.9999999874719],
              [-179.9500000157243, -59.98224871364589]
            ]
          ]
        },
        "proj:transform": [
          0.09998538835169517, 0.0, -179.9500000157243, 0.0,
          -0.09998816580074518, 89.9999999874719, 0.0, 0.0, 1.0
        ],
        "start_datetime": "2003-01-01T00:00:00+00:00"
      },
      "stac_version": "1.0.0",
      "stac_extensions": [
        "https://stac-extensions.github.io/projection/v1.0.0/schema.json",
        "https://stac-extensions.github.io/raster/v1.1.0/schema.json"
      ]
    }
  ],
  "links": [
    {
      "rel": "items",
      "type": "application/geo+json",
      "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend-airflow/items"
    },
    {
      "rel": "parent",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/"
    },
    {
      "rel": "root",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/"
    },
    {
      "rel": "self",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend-airflow"
    }
  ]
}

[2]

{
  "type": "FeatureCollection",
  "context": {
    "limit": 10,
    "matched": 1,
    "returned": 1
  },
  "features": [
    {
      "id": "DATWS_STL_based_trend.cog",
      "bbox": [
        -179.9500000157243, -59.98224871364589, 179.9973980503783,
        89.9999999874719
      ],
      "type": "Feature",
      "links": [
        {
          "rel": "collection",
          "type": "application/json",
          "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend"
        },
        {
          "rel": "parent",
          "type": "application/json",
          "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend"
        },
        {
          "rel": "root",
          "type": "application/json",
          "href": "https://dev-stac.delta-backend.com/"
        },
        {
          "rel": "self",
          "type": "application/geo+json",
          "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend/items/DATWS_STL_based_trend.cog"
        }
      ],
      "assets": {
        "cog_default": {
          "href": "s3://veda-data-store-staging/EIS/COG/LIS_GLOBAL_DA/DA_Trends/DATWS_STL_based_trend.cog.tif",
          "type": "image/tiff; application=geotiff; profile=cloud-optimized",
          "roles": ["data", "layer"],
          "title": "Default COG Layer",
          "description": "Cloud optimized default layer to display on map",
          "raster:bands": [
            {
              "scale": 1.0,
              "nodata": 0.0,
              "offset": 0.0,
              "sampling": "area",
              "data_type": "float64",
              "histogram": {
                "max": 101.29833221435547,
                "min": -555.0,
                "count": 11.0,
                "buckets": [
                  7843.0, 0.0, 2.0, 13.0, 24.0, 77.0, 353.0, 1228.0, 118651.0,
                  9.0
                ]
              },
              "statistics": {
                "mean": -36.01088186359726,
                "stddev": 133.02156258224915,
                "maximum": 101.29833221435547,
                "minimum": -555.0,
                "valid_percent": 29.319745316159253
              }
            }
          ]
        }
      },
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [-179.9500000157243, -59.98224871364589],
            [179.9973980503783, -59.98224871364589],
            [179.9973980503783, 89.9999999874719],
            [-179.9500000157243, 89.9999999874719],
            [-179.9500000157243, -59.98224871364589]
          ]
        ]
      },
      "collection": "lis-global-da-tws-trend",
      "properties": {
        "proj:bbox": [
          -179.9500000157243, -59.98224871364589, 179.9973980503783,
          89.9999999874719
        ],
        "proj:epsg": 4326.0,
        "proj:shape": [1500.0, 3600.0],
        "end_datetime": "2021-12-31T23:59:59+00:00",
        "proj:geometry": {
          "type": "Polygon",
          "coordinates": [
            [
              [-179.9500000157243, -59.98224871364589],
              [179.9973980503783, -59.98224871364589],
              [179.9973980503783, 89.9999999874719],
              [-179.9500000157243, 89.9999999874719],
              [-179.9500000157243, -59.98224871364589]
            ]
          ]
        },
        "proj:transform": [
          0.09998538835169517, 0.0, -179.9500000157243, 0.0,
          -0.09998816580074518, 89.9999999874719, 0.0, 0.0, 1.0
        ],
        "start_datetime": "2003-01-01T00:00:00+00:00"
      },
      "stac_version": "1.0.0",
      "stac_extensions": [
        "https://stac-extensions.github.io/projection/v1.0.0/schema.json",
        "https://stac-extensions.github.io/raster/v1.1.0/schema.json"
      ]
    }
  ],
  "links": [
    {
      "rel": "items",
      "type": "application/geo+json",
      "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend/items"
    },
    {
      "rel": "parent",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/"
    },
    {
      "rel": "root",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/"
    },
    {
      "rel": "self",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend"
    }
  ]
}

Comparison

On comparison, the STAC records look exactly the same for ingests from both systems [1] and [2].

Note: Did notice a discrepancy where the context["matched"] value is wrong for the airflow ingestion, but that's an auto-generated value and not because of any ingestion faults, right @anayeaye?

PI Objective

https://github.com/NASA-IMPACT/veda-architecture/issues/164

anayeaye commented 1 year ago

@slesaad Thanks for posting this comparison. I'm actually not seeing the difference in the number matched for each collection. What endpoint returned that matched 0 result? I tried /collections/<collection-id>/items and the /search endpoints below and see one match for each collection (hopefully I didn't use the wrong collection ids but can't see it yet 🙃 ).

lis-global-da-tws-trend-airflow

curl -X 'GET' \
  'https://dev-stac.delta-backend.com/search?collections=lis-global-da-tws-trend-airflow&limit=10' \
  -H 'accept: application/geo+json' | jq '.context'

{
  "limit": 10,
  "matched": 1,
  "returned": 1
}

lis-global-da-tws-trend

curl -X 'GET' \
  'https://dev-stac.delta-backend.com/search?collections=lis-global-da-tws-trend&limit=10' \ 
  -H 'accept: application/geo+json' | jq '.context'
{
  "limit": 10,
  "matched": 1,
  "returned": 1
}