bluesky / tiled

API to structured data
https://blueskyproject.io/tiled
BSD 3-Clause "New" or "Revised" License
59 stars 49 forks source link

Learning from xpublish #523

Open jakirkham opened 1 year ago

jakirkham commented 1 year ago

Thought folks here might find xpublish an interesting source of inspiration for tiled. So wanted to mention it here

danielballan commented 1 year ago

Oh yes, thanks! We did look at xpublish early on, but both projects have moved a lot since so it a great time to look again.

abkfenris commented 1 year ago

👋 from Xpublish-land.

I just heard from @rsignell-usgs about what you are up to with Tiled, and was thinking it might be worth chatting, but hadn't gotten a chance to post, so good timing.

Xpublish has had a bit of a revitalization over the last 6 months-ish, so may there be some notable changes since the last time you looked. Some of the changes, or the intents behind them may not be the most clear right now, due to some politics in the communities that we're coming from...

We're still working on the new sales pitch, but it goes something like this:

Xpublish is a pluggable core of a n-dimensional data server designed to take advantage of the Python data ecosystem.

There may be a reason I don't do sales.

While we still support the quick 'I have an Xarray dataset and want to serve Zarr' usage, we've pivoted into taking advantage of a plugin system to allow new serving methods to evolve separately from how data gets loaded into Xpublish. Once we made that change, we very quickly had plugins for OpenDAP, OGC EDR, WMS, and other ways of serving data. This also allows institutions to use their existing data storage infrastructure, instead of needing to build out a whole new one for a specific data server.

Currently we're still focused around xarray.Dataset as our internal interchange format, but we're at scheeming tabular and datatree support once we figure out what works and what doesn't for datasets.

One of the decisions that we've made is that the Xpublish library on its own isn't going to be a turn-key data server designed to be shipped to a data admin who may not know Python. Instead we are taking a bit of a Linux-like approach and having different packaged distributions that include and opinionated collections of data serving and loading plugins that a community might use.

I'd love to hear more about what the goals for Tiled is, along with the direction that you're planning on heading.

danielballan commented 1 year ago

Thanks @rsignell-usgs and @jakirkham for the connection.

It may be worth hopping on a call, @abkfenris, with (at least) @dylanmcreynolds, @tacaswell, and @jmaruland in the next couple weeks. We have gotten a lot of value out of conversations with groups working on related projects, to better understand how our requirements relate to others'. It would be great to chat.

In the mean time, I will attempt to summarize here where we are coming from.

We support arrays, tables, sparse arrays, and nested structures of these. In our data model, xarray.Dataset is just a nested structure of arrays that is marked up so that clients that know what "xarray" is (i.e. the Python client) can use xarray to represent it. We are working on adding support for awkward arrays as another first-class structure.

To summarize, our key aims are search, transcoding, structure-aware partial access, fine-grained access control, and a language-agnostic HTTP API.

abkfenris commented 1 year ago

Quickly as I'm being summoned for Duplo with my nephew, I'll tag @jhamman @mpiannucci @xaviernogueira who all may all be interested in a chat.

We could use our monthly community call if it works. Our next one will be on August 4th at 1 PM Eastern.

mpiannucci commented 1 year ago

I would be interested in chatting for sure

danielballan commented 1 year ago

That sounds good. Let me see if we can make that time work.

abkfenris commented 1 year ago

It sounds like a lot of Xpublish folks aren't going to be able to make our next meeting, so maybe aim for our one after that, September 1st at 1 Eastern?

danielballan commented 1 year ago

@abkfenris If you would send a video link and/or calendar invite to dallan@bnl.gov I can circulate it to people who may be interested on the Tiled side.

abkfenris commented 1 year ago

I'll ask Joe to add you.

danielballan commented 1 year ago

I'm looking forward to connecting tomorrow. Below are my notes on my understanding so far. Perhaps on the call we can walk through this, and Xpublish folks can fill in gaps.


Motivating use case

Both Xpublish and Tiled started from very similar use cases: "I have a dataset, or collection of datasets, and I want to provide chunked, sliceable access over HTTP."

Tiled started by building around the data structures: array, table, and nested structures of these (including xarray.Dataset). Xpublish started by building around xarray.Dataset. Both Tiled and Xpublish are on track to add more types of structures. Tiled added sparse arrays and is working on AwkwardArrays. Xpublish is looking at DataTree and tables too.

HTTP API

Xpublish presents an HTTP API that is recognized by as a valid "Zarr", readable by fsspec+zarr and presumably by HTTP+Zarr analogues in other languages.

A core use case for Tiled is holding ~1M datasets of hetergeneous structure, size, and shape, such as those produced by an X-ray synchrotron operating for a year or more. Tiled has a custom HTTP API that is designed to supports pagination and search/filtering. Tiled could add a /zarr/v3 route, but there are some details to work through on how to make this work for very wide groups or for structure types that do not (AFAICT) map cleanly into Zarr.

I should add: Tiled also cares about "scaling down" to easily serve a handful of datasets. Not all Tiled servers are "synchrotron scale". :-)

HTTP Client

Tiled has a custom Python client library, built on httpx, to match its custom HTTP API. It provides users with scipy structures (numpy, pandas, xarray, sparse) or, optionally, their dask analogues. Tiled also ships a prototype-grade React app, in part to validate that the server does not bake in Python idioms which would make it awkward to use from another langauage.

Xpublish users can just use fsspec+Zarr in Python, and presumably some HTTP+Zarr in other languages.

Database/Scaling

Xpublish holds the contents of its server as a Python dictionary. Tiled began with this method, and still supports it, but lately grew a SQL database (SQLite or PostgreSQL, as appropriate) to enable indexing and serving larger numbers of datasets. Xpublish does not have a database (AFAICT).

Extensibility

Xpublish advertises pluggable API routes for "serving data in new ways"---e.g. doing data reduction server-side. Tiled also supports pluggable routes, but as more of an "escape hatch" and a way to experiment out of Tiled core.

Tiled has more restricted extension points: registries of serializers, search queries, etc.

Transcoding

Tiled emphasizes support for HTTP clients that may not know or care about the SciPy/Zarr/Arrow ecosystem. Via content negotiation, a client can request data in an extensible range of formats, appropriate for the given data's structure. This sort of thing could be supported in Xpublish via custom routes, but I think it is not core to what Xpublish is doing.

Security and observability

Tiled stacks on some general HTTP service features that may or may not be in scope for Xpublish:

Some or all of the data sets in Tiled may also be public.

I think Xpublish is focused on public data sets ("publish").

Upload

Tiled recently added the ability to write data from the client over HTTP. I do not know whether this is in scope for Xpublish.

Cloud

Xpublish naturally supports cloud-based data sources well. Tiled has laid track for supporting cloud (e.g. s3) data storage, and it works at "demo" level, but at the moment local file:// access is much further fleshed out.

Initial thoughts and questions

abkfenris commented 1 year ago

I'm looking forward to connecting tomorrow.

Me too. I think there are a lot of ways that we can learn from each other and maybe combine forces to solve some shared issues.

Lets see if I can clarify some of these, or if I'll just make things more messy, but maybe the first thing to understand is the communities that are working on Xpublish.

Xpublish is the brain child of @jhamman (who can probably tell the earlier history better) and really grew out of the Pangeo community. At the time many of us were experimenting with moving our data to Zarr to free it from some of the access limitations of NetCDF (and on days that we've been especially bad, GRIB).

There was lots of experimentation, and some seriously cool work done, but at some point it feels like there was a collective feeling of 'oh we may have run too far ahead and left everyone else behind' and 'we're not gonna get everyone to the Zarr promised land so how do we work with existing data in our awesome new way'. Kerchunk, Pangeo-forge, and Xpublish are some of the ways that spawned out of that feeling, and tackle different aspects of those issues.

The current wave of Xpublish has a lot of folks from various IOOS regional data management teams (personally NERACOOS) and other similarly geospatially focused data managers. Many of us have the tension between existing servers that we may have been mandated to run to (ERDDAP, THREDDS for IOOS), supporting stakeholders who need different APIs, the wish to migrate new data to the cloud and more performant storage, and the gravity of existing data.


It may help to jump down to my response to security and observability where I try to relate our 'ecosystem of servers' vision.

HTTP API

Xpublish presents an HTTP API that is recognized by as a valid "Zarr", readable by fsspec+zarr and presumably by HTTP+Zarr analogues in other languages.

Xpublish only includes a Zarr v2 router in the core server (and that is up for discussion with Zarr v3 now out), but other serving methods are supported by plugins, such as OpenDAP, EDR API, and WMS.

HTTP Client

Xpublish supports existing clients in many languages for the plugin supported endpoints, but we aren't looking to develop any clients of our own.

Database/Scaling

Xpublish holds the contents of its server as a Python dictionary. Tiled began with this method, and still supports it, but lately grew a SQL database (SQLite or PostgreSQL, as appropriate) to enable indexing and serving larger numbers of datasets. Xpublish does not have a database (AFAICT).

While Xpublish serves the datasets provided to it in a dictionary out of the box, it's possible to reasonably quickly adapt it to a variety of dataset sources via dataset provider plugins. For example, we're currently exploring how to build dataset providers that work with existing ERDDAP or THREDDS catalogs, allowing Xpublish to sit alongside and augment those servers.

I'm not sure anyone is currently directly using a database with Xpublish. I've done some initial exploring if we could feed it datasets directly from our data workflow engines metadata store, but I haven't prioritized it, and instead focused on supporting file based catalogs like the other servers my team manages.

Transcoding

Tiled emphasizes support for HTTP clients that may not know or care about the SciPy/Zarr/Arrow ecosystem. Via content negotiation, a client can request data in an extensible range of formats, appropriate for the given data's structure. This sort of thing could be supported in Xpublish via custom routes, but I think it is not core to what Xpublish is doing.

We are taking the tack of custom routes, though mainly looking to support existing protocols via plugins.

Security and observability

Tiled stacks on some general HTTP service features that may or may not be in scope for Xpublish:

  • Prometheus metrics for monitoring server usage and performance
  • for single-user servers, a single-user API key auth (a la jupyter notebook)
  • for multi-user servers, pluggable auth with support for ORCID, Google, GitHub... (a la jupyterhub)
  • access control policies for controlling who can read/write which data

Some or all of the data sets in Tiled may also be public.

I think Xpublish is focused on public data sets ("publish").

Xpublish doesn't currently have opinions on these, but this is where our 'ecosystem of servers' ideas come more into play.

We're still trying to figure out the right way to describe it, but it may help to think of Xpublish like the Linux kernel.

On it's own it may not do much, but it provides all the connection points and interfaces to build more powerful tools on top of it. You could start adding those tools in yourself (various plugins), but most people instead start with a distro that's made a lot of those choices of tools (plugins) for them (server distributions).

With our existing plugin system we should be able to implement some security and observability via wrapper plugins. So far that hasn't been a priority of our users (most of us are tasked with making data freely available).

Initial thoughts and questions

  • With the database, transcoding features, and security features, the Tiled server is substantially heavier than the Xpublish server. (Compare the server deps.) But its scope and weight match our requirements well.

In general, so far Xpublish server distributions tend to be lighter as we don't have the database or security features.

  • This does reignite my interest in add a /zarr/v3 route to Tiled and sorting out how to make that work well.

I wonder if this might be one place we can collaborate (possibly also with xarray), and build something for Zarr similar to opendap-protocol that we could use for both of our Zarr v3 implementations.

  • Do Xpublish users ever need search, or are they general focused on a manageable number of data sets?

So far, I think most of us haven't treated Xpublish as a stand alone data server, and instead it's part of a larger ecosystem of servers and direct file access, and have instead used STAC catalogs and similar to direct users to it's endpoints.

  • What would it take to support DataTree in Tiled? How do the semantics compare to other nested structures?
  • It may be interesting to dig into the topic "an ecosystem of servers" and understand the vision in more detail.

I probably should have answered these questions in reverse order, since this vision leads to a lot of our other choices. Instead I kept my responses in the original order but wrote or at least pondered them out of order, so hopefully I haven't totally muddied the context.

danielballan commented 1 year ago

OK, something clicked for me. This resonates:

'we're not gonna get everyone to the Zarr promised land so how do we work with existing data in our awesome new way'

I think that Tiled and Xpublish are both solving these problems:

In Tiled's context, we're mostly talking about formats: legacy is CSV, specfile, HDF5, etc.; modern is Arrow, Parquet, Zarr, etc. The way out for us is transcoding, an extensible registry of ways to get standard data structures in and out of various serializations.

In Xpublish's context, you're mostly talking about services/APIs (e.g. THREDDS). For you, the way out is proxying services, an extensible ecosystem of API routers.

Is that a useful thought?

abkfenris commented 1 year ago

Yes, exactly. Xpublish is focused on services and APIs.

Some of the APIs in our plugin ecosystem provide format transcoding like services (I've extended the OGC EDR compatible plugin to offer additional formats), but it's not part of the core of Xpublish itself.

xaviernogueira commented 1 year ago

@danielballan just joining the conversation now, I will be at the meeting later this afternoon.

First off, very interesting-looking project, I'm excited to hear more about it.

Second off, and this is probably best as a topic for another day, but I think there is room for conversation regarding hooking up non-dictionary "backends" to Xpublish. We already have some light work in this area with xpublish-intake, my Catalog-To-Xpublish library, which is admittedly not as elegant as it could be (was developed with a strict time budget, would be better organized as a few plugins). But could we generalize an interface to work with a wider variety of nested dataset organization structures (STAC, relational dBases, DataTrees, ...).

The problem as I see it now is because not all datasets in an organizing structure will have a unique name, and their catalog/tree/database position may matter. Therefore these structures cannot (or should not) be flattened into a dictionary, which breaks the existing DatasetProvider plugin model (is this correct @abkfenris?).

I am meaning to start to work on a plugin to hook up an STAC catalog as a backend for xpublish, which was delayed somewhat, but would likely require a new plugin paradigm (i.e., DatasetCatalogProvider?). You got me thinking that maybe a catalog-focused provider plugin hook spec is too narrow, as it excludes databases. Maybe there could be a wider plugin pattern for anything that organizes datasets and provides the opportunity to explore the catalog/database programmatically (but safely w/o injecting SQL) and serve data from it. Or maybe a both DatasetDbaseProvider and DatasetCatalogProvider could be separate plugin specs?

Idk, these are new half-baked thoughts.

jhamman commented 1 year ago

Folks! Just a quick note to say that I'm unfortunately not going to make today's call but I'm excited to see this conversation moving forward. I'll circle back to get the notes and would be happy to engage on any follow ups. Cheers!

danielballan commented 1 year ago

What a fun call! I think we could learn a lot about our respective projects by continuing this conversation.

abkfenris commented 1 year ago

Yup! But I think we may need someone else to take notes, I did an especially poor job this time, or at least I barely captured any of the Xpublish side of the intro.

danielballan commented 1 year ago

The arrival of arraylake on the scene seems relevant this this thread. But I gather that the server code will not be OSS?

abkfenris commented 1 year ago

At least with Xpublish, I think Arraylake is complementary rather than competing.

We've got an Xpublish community meeting tomorrow, so hopefully @jhamman can make it and we can dig into how it fits into the ecosystem more (I missed the Pangeo talk yesterday, so I'm not sure what he said then).

danielballan commented 1 year ago

I'm off tomorrow afternoon so I'll miss that meeting, but I'll be interested in the minutes. I put some initial impressions on how Tiled and Arraylake's approaches compare over in https://github.com/bluesky/tiled/issues/575.