Collaborate with Pangeo and others in the scientific python community?

rabernat commented 2 years ago

Greetings and thanks for taking the time and effort to create open source software for the meteorology / climate communities. I sincerely commend this effort! 👏 I am particularly in inspired by the stated goal of the climetlab package:

reduce boilerplate code by providing high-level unified access to meteorological and climate datasets, allowing scientists to focus on their research instead of solving technical issues.

I think it's safe to say that very many people in the software community share this goal. It has been a particular focus of the Pangeo project for many years. However, it is also a difficult goal, given the vast diversity of different data providers, catalogs, and data formats that we encounter in the wild. Therefore, I believe that collaboration is essential for achieving this goal.

In this spirit, I would like to invite the climetlab developers to collaborate with the Pangeo project and related python packages. There may be some ways we can combine efforts to deliver more effective software with a lower overall maintenance burden.

A primary possible area of collaboration would be to reduce duplication in functionality across the ecosystem. Reducing duplication is good because it:

Minimizes end user confusing by aligning with the Zen of Python mantra "There should be one-- and preferably only one --obvious way to do it."
Maximizes the value of developer time
Leads to better integration across the ecosystem

In that spirit, here are some existing packages that offer functionality similar to climetlab.

Intake

https://github.com/intake/intake

GitHub contributors

Intake is a lightweight set of tools for loading and sharing data in data science projects. Intake helps you:

Load data from a variety of formats (see the current list of known plugins) into containers you already know, like Pandas dataframes, Python lists, NumPy arrays, and more.

Convert boilerplate data loading code into reusable Intake plugins

Describe data sets in catalog files for easy reuse and sharing between projects and with others.

Share catalog information (and data sets) over the network with the Intake server

Documentation is available at Read the Docs.

Status of intake and related packages is available at Status Dashboard

Weekly news about this repo and other related projects can be found on the wiki

Intake is the main tool we currently use in Pangeo to provide "convenient" data access in Pangeo (usually via the intake-xarray plugin). Intake has similar goals but a different architecture to climetlab's data source feature. With intake, one creates a catalog yaml file (example), which specifies the data sources and options for loading the data.

For example, to load the grib example file from the climetlab docs, I would write a yaml file like this

catalog.yaml

```yaml plugins: source: - module: intake_xarray sources: sample_grib_data: description: Sample GRIB file driver: netcdf args: urlpath: 'simplecache::https://github.com/ecmwf/climetlab/raw/main/docs/examples/test.grib' xarray_kwargs: engine: cfgrib ```

and then open it as an xarray dataset

import intake
cat = intake.open_catalog("catalog.yaml")
cat.sample_grib_data.to_dask()

Some other intake features that may be useful to this project:

Intake also has a rich templating system, allowing users to parametrize their data sources. I noticed that climetlab implements similar functionality.
The intake-xarray module supports combining / merging multiple files into a single dataset. I noticed climetlab implements similar functionality.
Intake uses a plugin architecture, similar to climetlab, to allow third parties to extend it to support different data formats and drivers. (See the plugin status dashboard for an exhaustive list)
Intake integrates with THREDDS, a highly established catalog format in the weather / climate world, via intake-thredds
Intake integrates with STAC (Spatiotemporal Asset Catalog) via intake-stac, an emerging catalog format that is being used heavily in the geospatial imaging world
Through its integration with filesystem-spec, intake can access data via a huge range of different transfer protocols

In terms of architecture:

Intake defines catalogs via yaml files. To expose new datasets to intake, third parties just put a yaml file online somewhere.
Climetlab defines catalogs via python code. There are some built in datasets which are hard coded into the core package. To expose new datasets to climetlab, third parties must write a python package that implements an entry point, publish it, and then have users install / import it. (Question: what are the criteria for including a dataset in the "core" package as opposed to a third party entrypoint?)

My unsolicited opinion is that the climetlab approach--writing python code for each new dataset--is not scalable to the volume and diversity of meteorology / climate datasets that exist in the world. Going down that path means effectively writing code to describe the structure / layout of every dataset in the world. Leveraging established community standards for data catalogs, or allowing users to very easily create their own catalogs, seems to me as the only viable path forward.

So a specific suggestion would be to refactor climetlab to use intake interally, rather than duplicating much of intake's functionality for data downloading, caching, loading, templating, etc. This would allow you to delete lots of code 🎉 and lower your maintenance burden. New functionality in terms of data loaders could be pursued upstream as intake plugins.

Pooch

https://github.com/fatiando/pooch

GitHub contributors

Does your Python package include sample datasets? Are you shipping them with the code? Are they getting too big?

Pooch is here to help! It will manage a data registry by downloading your data files from a server only when needed and storing them locally in a data cache (a folder on your computer).

Here are Pooch's main features:

Pure Python and minimal dependencies.

Download a file only if necessary (it's not in the data cache or needs to be updated).

Verify download integrity through SHA256 hashes (also used to check if a file needs to be updated).

Designed to be extended: plug in custom download (FTP, scp, etc) and post-processing (unzip, decompress, rename) functions.

Includes utilities to unzip/decompress the data upon download to save loading time.

Can handle basic HTTP authentication (for servers that require a login) and printing download progress bars.

Easily set up an environment variable to overwrite the data cache location.

Are you a scientist or researcher? Pooch can help you too!

Automatically download your data files so you don't have to keep them in your GitHub repository.

Make sure everyone running the code has the same version of the data files (enforced through the SHA256 hashes).

Pooch has a much narrower scope than intake. It is extremely stable and solid if what you want to do is download remote files to a local computer. It supports many of the same protocols as climetlab, and some other ones, such as Zenodo-based DOI downloads.

Here is how pooch would be used to download the climetlab test grib data

import pooch
import xarray as xr

catalog = pooch.create(
    path=pooch.os_cache("climetlab"),
    base_url="https://github.com/ecmwf/climetlab/raw/main/docs/examples/",
    registry={
        "test.grib": "md5:6395ffca06c42b8287d4d3f0e6d14d5f"
    }
)

local_file = catalog.fetch("test.grib")
xr.open_dataset(local_file, engine="cfgrib")

Here the opportunity for climetlab is to leverage Pooch's downloading / caching capabilities, rather than duplicating similar capabilities internally.

It may be possible that you looked at these packages and decided that they had feature gaps or bugs that made them unusable for your project. If so, an alternate path could be to work with the upstream libraries to resolve these gaps and bugs, instead of duplicating their functionality. Part of my goal in opening this issue is to state clearly that the broader scientific python community welcomes your involvement in and contributions to upstream packages. We would benefit greatly from your expertise.

Thank you for taking the time to read my long issue. I reiterate my commendation of your efforts to provide open-source software to the community and my alignment with your vision regarding the goals of this package. I welcome a discussion on these topics or any other ways you think we could be collaborating. 🙏

floriankrb commented 2 years ago

Thanks for the very detailed review of these packages (intake and pooch), and thank you for your interest in CliMetLab.

Regarding intake, we do plan to have interoperability between climetlab and intake

We have in mind a CliMetLab plugin integrating an intake dataset into climetlab.load_dataset(). And also, having an intake data set integrating any climetlab dataset into an intake catalogue. intake.open_catalog("climetlab"). The final API for the end-user could be somehow similar. Discussion on how to write these plugins would be a good idea. Additionally, discussions on the differences between publishing a dataset with climetlab vs intake could clarify both approaches.

Regarding using pooch for caching

Replacing the caching mechanism of CliMetLab with another package (such as pooch) is indeed always possible. We currently use our own code for caching because we are still exploring different caching mechanisms (invalidation policies, and various ways to share cache between different users, or using different backends, such as S3, partial caching of the data, I am not sure how pooch handle partial download using HTTP PARTS, or using different formats). One starting point would be to integrate pooch into climetlab as a source plugin, allowing comparison and benchmarks on real use cases. This could be a nice way to assess if switching to pooch is the way to go and to let the users choose.

CliMetLab is domain-specific

Generally speaking, CliMetLab is a domain-specific package (climate and meteorology) relying heavily on more general-purpose packages such as those in the Pangeo stack. We envision CliMetLab as a thin layer around these, providing python functionalities for the scientist/engineer/data-scientist in these domains, and while we need to develop some code ourselves, we are always happy to mutualise the effort as much as possible.

floriankrb commented 2 years ago

Closing for now, anybody should feel free to reopen (or open another specific issue) if there are relevant packages that should be considered.

ecmwf / climetlab

Collaborate with Pangeo and others in the scientific python community? #35

Intake

Pooch