chbrandt / gpt

Geo-Planetary Tools
3 stars 1 forks source link

Merge NPT into GPT; Release GPT. #5

Open chbrandt opened 1 year ago

chbrandt commented 1 year ago

NPT developed in parallel and diverged from GPT. It has a cleaner interface for ODE and is stable in the data reduction pipeline. The specific code blocks to be merged are not clear yet.

Tasks:

chbrandt commented 1 year ago

Refactor API

The idea is to a more OO interface focused on data stores, managing data products and data sets in and out those stores.

The primary functionality of a data store is a "search" function. Ultimately we want to get data products to we can analyse however necessary. To get products "X, Y, Z", we need first to know about their existence; Hence, the "search" function. Data stores will organize their products in datasets; Datasets don't have to be searched, rather listed directly.

Back in the days, this library was developed from the store, then went down to the (data) products. Today, let's start from the product and then move up to the store. The reason is to give more attention to products' functionalities.

Data Store

A (spatial) data product is composed by at least one data file, besides the metadata. In the metadata attributes, geometry is always present; The geometry may be a simple "Point", or a "Multi-Polygon", and anything in between.

The structure of the data product -- i.e., type and quantity of files and metadata schema -- varies from dataset to dataset. The structure, definition linking to ancillary data is done by the metadata fields. Data products in the same dataset are expected to share the same structure.

How to handle the metadata/data set is the task of the data store, which defines methods/actions to manage the product(s)

In terms of implementation, we have a Data Store, with one or more Datasets, with one or more Data Products.

Methods data stores should implement:

The writing method is implicit in any data move, for instance, when downloading or transforming data. We can call those methods "actions".

Let's go through a typical workflow when handling images from NASA planetary remote-sensing missions.

Let's consider the Mars Reconnaissance Orbiter (MRO) Context camera's (CTX) Experiment Data Record (EDR) dataset. NASA Planetary Data System (PDS) provides the Orbital Data Explorer (ODE) interface to access data from different planets and satellites (eg, Mars, the Moon). ODE provides a REST interface for programmatic access, which is an example of a read-only data store.

ODE data products are usually composed by multiple ancillary files: images, shapefiles, other/more metadata

  1. Search ODE REST for Mars images in MRO/CTX/EDR dataset. If successful, receives a JSON payload of results with metadata about the dataset and for each data product found.
  2. Download browse (.JPG) and data (.IMG) files associated with each data product.
    • Save metadata of each data product next to the related files (as a JSON document).
  3. (For each product) Transform IMG image into a light/space calibrated GeoTiFF image.
    • Save the new (.cog) image in a new directory, part of a new "Science-Ready" dataset.
    • Save a new preview (browse) image from the new data image.
    • Save the new metadata set -- an updated version of "EDR" -- next to the related files.

In this workflow we are working with two data stores, "ODE" and "Local"; And two datasets, "EDR" and "Science-Ready". The communication with the data stores -- and actions taken upon the data products -- is done through a handlers.

Data store interface

import api

List of available data stores:

stores = api.datastores.list()    # list of available data stores
print(stores)
['ode']

Connect to a data store:

ode_ds = api.datastores.connect('ode')
ode_ds.info()
# (information about ODE)

List datasets:

datasets = ode_ds.datasets.list('mars')    # list of available datasets. ODE demands a target body
print(datasets)
[...
mro/ctx/edr
mro/crism/trdr
...]

Create a handler for CTX (EDR):

ctx_edr = ode_ds.dataset('mars', 'mro/ctx/edr')
ctx_edr.info()
# (information about mro/ctx/edr)

Search CTX data products:

products = ctx_edr.search(*args, **kwargs)    # return table (geodataframe) with matching results/products
print(products)
# print products metadata

Download CTX products:

# Create a local data store
local_ds = api.datastores.connect('./data')

ctx_images = products.ds.download(local_ds, assets='image')

At this point, "images" from the CTX/EDR dataset are downloaded to the local data store at ./data. The local data store will write the downloaded images together with any mandatory asset/ancillary file under ode/mro/ctx/edr (under ./data). The metadata associated to each product -- updated accordingly to point to the local filesystem instead of the remote (ode) data store -- is written next to the image. The same metadata set, containing all data products of the respective dataset (ie, MRO/CTX/EDR), is merge into the dataset's global table, as the dataset index.

Suppose that when we did the search we got one product as result, product "XYZ", with an associated image "XYZ.IMG". The structure of the local data store at this point is:

./data/
  `- ode/
    `- ctx/
      |- index.csv
      `- products/
        |- XYZ.IMG
        `- XYZ.json