Develop/find some end-user stories

mattjbr123 commented 3 weeks ago

How will they want to use the available data/interact with it?

This will inform discussions about the API and version control layers

mattjbr123 commented 2 weeks ago

Some helpful comments from @mattfry-ceh on the product description document for kick-starting things here:

Some key use cases include:

Modellers can use existing code to consume this gridded time series data

Analytical scripts can readily batch extract time series data for a point or averaged over an area

DataLabs users can run these types of data extraction scripts to make it easier to do

Users can access this type of data over the web - API for extracting data for a grid cell or averaging over a given area.

and

Could we frame this as that we are supporting more use cases (web access and API access, better cataloguing, amongst other things) for datasets that are currently largely produced and accessed in netCDF. We imagine a gradual transition to more wide usage of these newer cloud approaches, and want to provide the datasets, tools, worked examples, documentation, etc. to help people make that move.

mattjbr123 commented 2 weeks ago

Some more ideas, from the product description document:

Exploratory data analysis Currently users wishing to access data stored on EIDC or CEDA (For example) for analysis have to download the data somewhere to perform their analysis on it, creating multiple copies of the data with unclear provenance. We want to provide the ability to access the ‘one immutable copy’ of the data in users codes, which we could do by providing code snippets that point to the data, and the example notebooks in datalabs, that allow users to easily access the data in their existing and new analyses.

Fortran modellers A typical workflow is to read in NetCDF files using Fortran’s NetCDF Libraries, do some modelling, and write gridded outputs to NetCDF. These users won’t be interested in data conversion to ARCO or accessing such data in datalabs. They will want NetCDF files that can be downloaded to disk so that their models won’t have to be significantly rewritten to accommodate new data storage practices.
How we balance this need against having a “single immutable copy of the data” is difficult, as downloading the data to an external system breaks this somewhat and makes the provenance of datasets created from their models more difficult to track. Would it be possible to run Fortran models on datalabs as a sort of halfway house? Data could be copied from the central repository to a read-only disk storage mount, which could then be read into their models without needing to change anything more than the file paths, and output to a read-write disk store? Just brainstorming here really.

Data archival requirement Many projects now require datasets produced as part of the project to be archived in an ‘official’ datacentre. For example, EIDC takes in a lot of datasets from NERC-funded projects this way, as does the CEDA Archive (run by the JASMIN team). These are the places we ultimately want to store the ‘single immutable copy’ of datasets. Providing a tool that converts data to ARCO and uploads the dataset to the object storage and EIDC would be fantastic for these users. I’ve spent too much of my time manually converting data to CF-compliance for example! Such a tool would involve adding a tool to capture the metadata the EIDC/cataloguing software needs to the conversion/ingestion/upload tool. It would make data on the EIDC much more accessible to the “Exploratory Data Analysis” users described above, whilst maintaining the Fortran Modellers user story above too.

mattjbr123 commented 2 weeks ago

Next question to think about - how do the above use cases affect the need and form of an API

mattjbr123 commented 2 weeks ago

Fortran modellers A typical workflow is to read in NetCDF files using Fortran’s NetCDF Libraries, do some modelling, and write gridded outputs to NetCDF. These users won’t be interested in data conversion to ARCO or accessing such data in datalabs. They will want NetCDF files that can be downloaded to disk so that their models won’t have to be significantly rewritten to accommodate new data storage practices. How we balance this need against having a “single immutable copy of the data” is difficult, as downloading the data to an external system breaks this somewhat and makes the provenance of datasets created from their models more difficult to track. Would it be possible to run Fortran models on datalabs as a sort of halfway house? Data could be copied from the central repository to a read-only disk storage mount, which could then be read into their models without needing to change anything more than the file paths, and output to a read-write disk store? Just brainstorming here really.

Modellers can use existing code to consume this gridded time series data

The three dominant languages in use in the hydrological community are Python, R and Fortran (note this is just my hunch from experience, I haven't verified it anywhere...)

Modellers using Fortran code will struggle to make use of this product. They will want to be able to download the driving data and have it available locally on disk (which we don't really want). An alternative is for us to use some bespoke Fortran code or wrap some Python in Fortran code that these modellers could use to access the data from the object store instead. Could be a lot of work if Fortran libraries to do this don't already exist. Are there Fortran libraries to integrate with an API?

Python or R modellers will hopefully uptake this more, given we are aiming to make the necessary changes to scripts as minimal as possible. My hunch is that most Python modellers use xarray to work with NetCDF, those that don't might need to change to using it, but we can already point them at a UKCEH training course I developed for xarray (or the vast "array" of training courses for xarray that already exist). Not so sure about R, but I would hope we can do similar things than we can in python: provide an intermediate library (possibly with an API) between the object store and the netcdf4 R package that essentially makes the object store appear as a disk to the user.

Exploratory data analysis Currently users wishing to access data stored on EIDC or CEDA (For example) for analysis have to download the data somewhere to perform their analysis on it, creating multiple copies of the data with unclear provenance. We want to provide the ability to access the ‘one immutable copy’ of the data in users codes, which we could do by providing code snippets that point to the data, and the example notebooks in datalabs, that allow users to easily access the data in their existing and new analyses.

Analytical python or R scripts can readily batch extract time series data for a point or averaged over an area

People are used to running scripts like this on local and local-ish machines (such as UKCEH private cloud VMs, JASMIN Sci servers). For small (fits in memory) requests of data I guess this remains fine. I don't think there's a specific case for an API here from the users' perspective, but we might want one as developers for other developy/monitory reasons. Either way we would need to supply the template/boilerplate code needed to access the data (which can either involve an API or not). A separate issue is consistent (python/R) environment setups, but we can provide basic instructions with the example code and wash our hands of the rest ("not our problem"), especially if we're providing other environment-controlled infrastructure on which the code can be run. Such as...:

DataLabs. Users can run these types of data extraction scripts to make it [and their data analyses] easier to do...

...if the example code on the EIDC catalogue page of the data links to a notebook on datalabs (or JASMIN notebook service or jupyter labs on AWS etc.) with the right environment already installed. Previously I/users doing this sort of thing have accessed object-storage data using some boilerplate "run once and get out of the way" code/code libraries like FSSpec or Intake where all the necessary config can be prefilled/pregenerated for a given dataset and then ignored. All the user need then do is input their secrets/credentials if necessary (usually when the dataset is not public) and essentially run code as normal via an xarray open_zarr command. Any API and associated config would have to play nice with these libraries and be similarly "get out of the way" code.

Users can access this type of data over the web - API for extracting data for a grid cell or averaging over a given area.

What does "access this type of data over the web" mean? I'm going to assume it means accessing via a website/portal with a clicky GUI and maps. This will definitely need an API to handle the requests coming from the website. Requests such as:

Plot this geographical area for this timestamp
Plot a timeseries graph of this/nearest gridpoint (click a point on the map?)
Download of these data snapshots

are the simple ones

More complicated ones that would need the aggregation to be processed somewhere are:

Timeseries or single snapshot in time of spatial average/sum/etc. of this area (e.g select a box on the map, catchment via upload of a shapefile)
Time average/sum etc. of a single spatial point or selected multiple points or average/sum etc. of selected multiple points

Exactly which will be the dominant use case will depend on the dataset.

For our trial dataset - GEAR 1hrly - extracting out timeseries for spatial points/catchment areas for analysis is probably the dominant use case??

Data archival requirement Many projects now require datasets produced as part of the project to be archived in an ‘official’ datacentre. For example, EIDC takes in a lot of datasets from NERC-funded projects this way, as does the CEDA Archive (run by the JASMIN team). These are the places we ultimately want to store the ‘single immutable copy’ of datasets. Providing a tool that converts data to ARCO and uploads the dataset to the object storage and EIDC would be fantastic for these users. I’ve spent too much of my time manually converting data to CF-compliance for example! Such a tool would involve adding a tool to capture the metadata the EIDC/cataloguing software needs to the conversion/ingestion/upload tool. It would make data on the EIDC much more accessible to the “Exploratory Data Analysis” users described above, whilst maintaining the Fortran Modellers user story above too.

This is more about the rechunking tool stage of the product, so less relevant to the API discussion.

mattjbr123 commented 2 weeks ago

@mjbr, @fsamreen, @dolegi some user stories for you to have a look at and comment on

NERC-CEH / dri_gridded_data

Develop/find some end-user stories #30