API design: Moving away from spotlights

This issue will discuss the API design needed in order to implement a "site-less" dashboard. This issue will first discuss the context of the feature, including breaking and external facing changes to the API, and then go over internal design and architecture options. Eventually this issue will be used to communicate breaking changes to partners as well as to generate an ADR detailing the API design decisions.

Context:

The covid dashboard displays COGs for various datasets, some specific to certain sites (spotlights) and some covering the globe, for various dates (generally between 2015 and now). The dashboard depends on the /datasets endpoint of the API to provide, in a single response, all of the dates available for a given dataset, filterable by site (spotlight) (or by global, for non-site-specific datasets).

Originally, the idea behind spotlights was to provide a select few sites for which the user could find a comprehensive set of datasets to illustrate the impact of Covid on the world economy. This sites concept does not scale well for a variety of reasons, including the fact that it forces science teams to reformat (and restrict) their data to a small(er) geographic extent and forces the end users to browse an ever growing list of sites that don't always contain the same datasets (eg: Great Lakes site only contains 3 water quality datasets, the Suez Canal spotlight only contains ship detections, whereas most other sites have a half dozen datasets).

Going forward we will explore the idea of abandoning individual sites. This means that for non-global (site-specific) datasets we should have some way of indicating to the user where they should zoom into to find the data (since the site-specific datasets sometimes have very small geographic extent).

Some facts about the data:

For any given date, a non-global dataset will have between 1 and ~100 data files, or COGs (eg: Slowdown and Recovery Proxy maps have about 120 cities)
Each individual COG can cover an extent between ~10km (Water quality in SF Bay) and ~1000km (Nightlights VIIRS covering most of the main islands of Japan).
The sites available for one date might not be the same sites available for a different date, for the same dataset
The COGs for the same site might not have the same bounds from one date to the next, for the same dataset (eg: Water quality for NY might cover the Upper Bay for one date and the Long Island Sound for a different date

Breaking changes:

Current behaviour:

Querying the /datasets endpoint without specifying a site returns a list of dataset metadata objects that look like:

{
      "id": "water-chlorophyll",
      "name": "Chlorophyll-a Anomaly",
      "type": "raster-timeseries",
      "isPeriodic": false,
      "timeUnit": "day",
      "domain": [
        "2018-01-03T00:00:00Z",
        "2018-01-06T00:00:00Z",
        "2018-01-10T00:00:00Z",
        "2018-01-13T00:00:00Z",
        [TRUNCATED]
        "2021-08-09T00:00:00Z",
        "2021-08-11T00:00:00Z",
        "2021-08-18T00:00:00Z",
        "2021-08-25T00:00:00Z",
        "2021-09-01T00:00:00Z"
      ],
      "source": {
        "type": "raster",
        "tiles": [
          "http://localhost:8000/v1/{z}/{x}/{y}@1x?url=s3://covid-eo-data/oc3_chla_anomaly/anomaly-chl-{spotlightId}-{date}.tif&resampling_method=bilinear&bidx=1&rescale=-100%2C100&color_map=rdbu_r"
        ]
      },
      "backgroundSource": null,
      "exclusiveWith": [
        "agriculture",
        "no2",
        [TRUNCATED]
        "detections-vehicles"
      ],
      "swatch": {
        "color": "#154F8D",
        "name": "Deep blue"
      },
      "compare": null,
      "legend": {
        "type": "gradient",
        "min": "less",
        "max": "more",
        "stops": [
          "#3A88BD",
          "#C9E0ED",
          "#E4EEF3",
          "#FDDCC9",
          "#DE725B",
          "#67001F"
        ]
      },
      "paint": null,
      "info": "Chlorophyll-a is an indicator of algae growth. Redder colors indicate increases in chlorophyll-a and worse water quality. Bluer colors indicate decreases in chlorophyll-a and improved water quality. White areas indicate no change."
    },

where the domain key contains all of the available dates across all of the sites.

If a spotlight id is appended to the query (eg: /datasets/sf) then the domain key will only contain the dates for which that dataset has data for that spotlight.

*Note: if isPeriodic is True then the domain key will only contain 2 dates, the start and end date and it should be assumed that data will exist at each interval specified by the timeUnit key (either day or month) between those two dates.

The returned data also contains a pre-formatted url (tile.url), along with other information about how to display that dataset, what legend colour to use, etc. The frontend can request the data for the desired date by inserting that date (and spotlight id, if specified) into the pre-formatted url, tile.url, replacing the {date} and {spotlight_id} variables, respectively and performing a GET request on this url.

*Note: if the value of the timeUnit field is month, then the {date} field must be formatted as: MMYYYY otherwise if the field's value is day then the date must be formatted as YYYY_MM_DD.

Proposed behaviour:

With spotlights, a user would open the dashboard, select a spotlight area to zoom into, and then would be presented with a set of datasets available for that spotlight. As the user toggles different datasets on and off the available dates for each dataset are displayed along the bottom of the dashboard.

Without spotlights, this process will be flipped: the user will connect to the dashboard (which will be presented at the furthest zoom level) and will select a dataset to explore. If this is a non-global dataset, the dashboard will, instead of loading any COGs onto the map, display to the user where they should zoom into to find a COG. We have imagined this as a type of "cluster" point that would disappear once an appropriate zoom level for displaying the COG had been reached. (eg: https://developers.arcgis.com/javascript/latest/sample-code/featurereduction-cluster/)

In order to make this possible the /datasets endpoint will have to return, for a given dataset, all of the available dates, and for each date, some sort of geographic extent, representing the underlying COG, at a high zoom level. The COG's bounding box would be a good place to start - since it's easy to extract from the COG.

The /datasets response would look something like:

{
    "id": "water-chlorophyll",
    "name": "Chlorophyll-a Anomaly",
    "type": "raster-timeseries",
    "isPeriodic": false,
    "timeUnit": "day",
    "domain": [
        {"2018-01-03T00:00:00Z": [[-122.70, 45.51, -122.64, 45.53]]},
        {"2018-01-06T00:00:00Z": [[-122.70, 45.51, -122.64, 45.53], [-122.70, 45.51, -122.64, 45.53]]},
        {"2018-01-10T00:00:00Z": [[-122.70, 45.51, -122.64, 45.53]]},
        {"2018-01-13T00:00:00Z": [[-122.70, 45.51, -122.64, 45.53], [-122.70, 45.51, -122.64, 45.53], [-122.70, 45.51, -122.64, 45.53]]},
        [TRUNCATED]
        {"2021-08-09T00:00:00Z": [[-122.70, 45.51, -122.64, 45.53]]},
        {"2021-08-11T00:00:00Z": [[-122.70, 45.51, -122.64, 45.53]]},

    ],
    "source": {
        "type": "raster",
        "tiles": [
            "http://localhost:8000/v1/{z}/{x}/{y}@1x?url=s3://covid-eo-data/oc3_chla_anomaly/anomaly-chl-{spotlightId}-{date}.tif&resampling_method=bilinear&bidx=1&rescale=-100%2C100&color_map=rdbu_r"
        ]
    },
    "backgroundSource": null,
    "exclusiveWith": [
        "agriculture",
        "no2",
        [TRUNCATED]
        "detections-vehicles"
    ],
    "swatch": {
        "color": "#154F8D",
        "name": "Deep blue"
    },
    "compare": null,
    "legend": {
        "type": "gradient",
        "min": "less",
        "max": "more",
        "stops": [
            "#3A88BD",
            "#C9E0ED",
            "#E4EEF3",
            "#FDDCC9",
            "#DE725B",
            "#67001F"
        ]
    },
    "paint": null,
    "info": "Chlorophyll-a is an indicator of algae growth. Redder colors indicate increases in chlorophyll-a and worse water quality. Bluer colors indicate decreases in chlorophyll-a and improved water quality. White areas indicate no change."
},

Where each of the dates in the domain list will no longer be a datestring, but rather an object with a single key-value pair: the datestring and a list of 1 or more bounding boxes, corresponding to the bounding boxes of the COGs available for that date.

*Note: the domain field can be a list of objects (ie: domain: [{"date_1":[[bbox1], [bbox2]]}, {"date_1":[[bbox1]]}]) or a dictionary with a key for each date (ie: domain: {"date_1": [[bbox1], [bbox2]], "date_2": [[bbox1]]})

The metadata's tile.url object will still contain a url that the frontend can request and which will return a displayable image (however the spotlight id field will no longer be present in the url).

Implementation Options:

There are two main aspects of this implementation: how the API will "tile" the disparate data sources, and how the metadata related to which dates are available for a given dataset and the geographic extent of each of the files available for that date for that dataset.

Tilling disparate data sources:

There are 2 options: .vrt files and mosaicJSON.

`.vrt` files:

A .vrt file is an xml file containing the geographic bounds and locations of several COG's. The API, which uses rasterio, can directly "open" a .vrt file, as if it were a COG, and the engine will know which of the COGs in the .vrt to pull from, given the requested row, column and zoom level.

MosaicJSON:

MosaicJSON is a more advanced specification and likely more performant. @vincentsarago Do you have any insights as to the advantages of MosaicJSON over a .vrt file for this use case? Also, is it possible to dynamically create mosaics on the fly? Or do those have to be pre-generated?

Metadata

Dataset metadata with available dates and geographic extent for said dates is a much more open ended question. Some possible options include: using a STAC API, implementing a queryable metadata store, incorporating bounding boxes into the current dataset domain generation logic.

STAC API:

The STAC API option entails creating a datastore-backed STAC API instance where we store a STAC record for each COG in the S3 bucket. Each stack record has a geographic extent and a timestamp, and the COGs are searchable by date. This make a very elegant solution to our needs, and further helps make the Covid API a "plug-and-play", generalizable dashboard as STAC is a widely accepted metadata standard and many data providers are already using STAC to catalog their data (STAC API's can also be used to collect datasets stored in different locations into a single interface).

The cons of implementing a STAC API are in the complexity of such an implementation. A STAC API requires a whole new set of infrastructure to maintain and deploy, including a persistent datastore. This is especially burdensome as the specific features we need at this moment are (somewhat) easily implemented otherwise

Queryable metadata store:

This is a mid-complexity option where an S3 trigger would keep a persistent datastore (Dynamodb is often used for such tasks) updated with the metadata of files in S3 - which would include timestamps and geographic extent. The pros of this solution are less new infrastructure is needed (an S3 trigger, a processing lambda and a datastore) and the /datasets endpoint will be able to easily search for available data. Cons of this approach are that, for scientists wanting to deploy their own instances of the dashboard, it would force them to re-ingest data into a datastore that they may already have in STAC representation.

Incorporate into the current dataset metadata generation:

This option requires no additional infrastructure to be deployed. Currently, the /datasets endpoint collects the available dates for each dataset + site combo from a JSON file stored in the S3 bucket, alongside the rest of the data. This file is regenerated once every 24hrs using a Cloudwatch event which triggers a lambda function. The lambda function lists all of the files in the S3 bucket, extracts the date and site (if exists) fields from the filenames for each datasets and then writes them to a JSON file. The new implementation would perform the same steps in additions to extracting a geographic extent from each file to store alongside the date (skipping the "site" field - since it's no longer needed in the frontend). The bounding box of a COG is easily extracted from a file using rasterio.

One change required to the approach would be to somehow "flag" which files have been cataloged in the dataset domain file, in order to avoid re-extracting the bounding boxes from files that have already been processed. This can easily be done with S3 file headers.

A note on environments:

I have experienced some frustrations with the current system of domain metadata files store in S3 when working on datasets first in the the dev and staging environments, and then deploying to the production environments. Since the total quantity of data in the bucket is quite large we have not created prod, staging and dev buckets with data duplicated in each. This means that in order to test a new dataset, before making it available in production, I have to first upload the dataset to the production bucket, add a new dataset metadata file for the dataset being tested to my local repository, run the domain metadata generation code locally and upload the metadata file to S3. This pollutes the s3 bucket both with data might not be deployed to production, and with dataset metadata files for dev branches that have been closed. Whichever solution we chose - I'd like to make it a priority to have a simple and well defined process for developing and testing new datasets, both locally and in the dev/staging AWS environments, before deploying them to productions.

NASA-IMPACT / covid-api