Integration with Intake

martindurant commented 5 years ago

Intake is a general-purpose python package for accessing and cataloguing data. It is in use, amongst other places, by the Pangeo geo-atmospheric collaboration to reference several datasets (see a nice rendering of the nested catalogue structure at https://pangeo-data.github.io/pangeo-datastore/ ) and has integrations with scientific data collection descriptions such as STAC, THREDDS and CMIP.

It seems that we have some things in common! Particularly, Intake cares about careful description of datasets, a unified user API for inspecting and loading those data (into memory or dask, various containers), caching for load-on-first-access.

Lets start a conversation here about possible a integration, e.g., that Intake would show the rockhound collections as one of its catalogues.

cc @rabernat

welcome[bot] commented 5 years ago

👋 Thanks for opening your first issue here! Please make sure you filled out the template with as much detail as possible.

You might also want to take a look at our Contributing Guide and Code of Conduct.

leouieda commented 5 years ago

Hi @martindurant thanks for reaching out! I've had my eye on intake for a while but haven't really dug in to see what it can do. It would be great to have some collaboration and not have to reinvent things here.

Right now, we're using Pooch to manage the downloads because that's what we're using for other things and the configuration is minimal. I'm not very invested in having RockHound rely on Pooch if Intake would be a better solution for what we need.

What would be the key selling point for Intake, in your opinion?

leouieda commented 5 years ago

@santisoler it would be great to know your thoughts on this as well

martindurant commented 5 years ago

I can only encourage you to read the documentation: Intake aims to be a small and simple layer over the massive amount of python code which exists to access various data types in various storage solutions. Intake brings a common API to it all, and a cataloging solution so that you can search through all the known data sources, local and remote, introspect their details (metadata and derived from the data itself) to be able to quickly get whet you need - and then get out of your way so you can get on with useful analysis.

Creating a catalog and sharing it is as simple as figuring out the right arguments to your loader (something you need to do anyway), and encoding that into a YAML file, which can then be put in any public place. No server infrastructure involved. There are many other possibilities for dynamic catalog services, though, and we have integrations with thredds, cmip6, stac...

Intake does indeed include data download services, either local copies of remote files ("caching", which is what pooch does) or local optimised-format version of the data ("persistence"); but Intake is designed also with cloud-native data in mind, where you access the data in-place on some service like s3/gcs/azure. The ability to do that is format-dependent, e.g., zarr is a nice cloud-optimised format for netCDF-like array data, whereas hdf5 is lees friendly (but newly possible, at least on s3). Similarly, parquet has become the de-facto standard for tabular data, because it allows you to load only what you need.

Intake also interfaces nicely with Dask, so if you are doing your analysis at scale on a cluster of workers, downloading the whole dataset to each node, or mounting ephemeral shared directories is not realistic, you want each worker to grab chunks of the data and work on that alone.

I'm not sure what else to say here; experience so far has shown that people have little trouble writing an Intake driver for a number of data providers, and the feedback from those who then access the data through Intake has been pretty positive. Of course, I might not be hearing from those who were dissatisfied - but Intake is young and relatively simple, and we try to be responsive to users' needs.

martindurant commented 5 years ago

For reference, here is the pangeo master intake catalog, rendered as static HTML: https://pangeo-data.github.io/pangeo-datastore/master.html , containing quite a few entries in the hierarchical tree. There may even be some overlap with hound's collection.

fatiando / rockhound

Integration with Intake #43