Learning from Arraylake

danielballan commented 1 year ago

The new data service Arraylake was unveiled yesterday. I think there is lots to learn from this project. Here are some notes on initiial impressions. I'd be very interested if anyone from Arraylake has time to chime in.

Arraylake and Tiled have very similar views on the problem to be solved. Specifically, providing a standard API into data, including support for chunked access and slicing, regardless of how it happens to be stored at rest. Their approaches are similar in a lot of ways.

Based on an initial reading, I think the main distinctions are:

Arraylake is taking on version control as a first-class feature from the start, which many other data services also do. Tiled has intentionally not touched that yet. (Not to say it won't.) We do keep a history of revisions of metadata, but if a chunk of data is updated, the old version is simply overwritten.
Arraylake is centered on using Zarr as the protocol for accessing chunked data. Tiled is interested in providing views into the data in a range of formats, as chosen by the client, to support clients that do not understand Zarr (like curl).
Tiled and supporting first-class support for structures that are not arrays: tables with Arrow semantics, sparse arrays, AwkwardArrays, and a perhaps a couple more to come. There are ways to encode these into Zarr, but not really in a first-class way.

It's not exactly clear to me yet how search works in Arraylake, but I am imagine it will have similar goals to Tiled's in this area.

rabernat commented 1 year ago

Thanks for opening this discussion! This feels like an accurate comparison between the two.

However, I'd like to share some additional context about the use cases we have designed Arraylake for. We are really building towards something more like a database rather than a data portal. Specifically, it's very important to us to support high-throughput writing of array data in a robust and consistent way. We have gone to great length to implement a transaction system which provides serializable isolation--the version control features are a pretty straightforward consequence of that design. In our experience, this is where most existing approaches to scientific data management really struggle.

From reading about Tiled, would it be fair to say that the main use case is more centered around serving a semi-static set of data files that have already been generated elsewhere? What is the process like for updating the Tiled catalog?

Arraylake is centered on using Zarr as the protocol for accessing chunked data.

Zarr is the native format. We can wrap Zarr-compatible files via a kerchunk-like approach. We currently assume that reader will always wants to access the data via Zarr. However, our roadmap includes proxies for conversion to other formats, other APIs (e.g. OpenDAP, OGC, etc.), which layer easily on to of the high-performance foundation of Zarr.

It's not exactly clear to me yet how search works in Arraylake,

It's not well documented yet unfortunately. But we have rich metadata search on top of all the Zarr metadata in a repo via JMESpath syntax.

Happy to continue the discussion! We'll set you up with an account so you can kick the tires! We'd love your feedback.

danielballan commented 1 year ago

Hello @rabernat. Very nice of you to make time to comment here, during what I gather is a busy time over at Earthmover. Congrats on your beta launch.

This is really helpful. I see how the version control feature flows naturally from the requirement for robust transactions.

From reading about Tiled, would it be fair to say that the main use case is more centered around serving a semi-static set of data files that have already been generated elsewhere?

We began with the use case, "I've got a bunch of (mostly static) files, and I want to access them via a service, but I also don't want to break any existing file-based workflows." In this mode, Tiled has a read-only view of the files, and it has an internal database---PostgreSQL or SQLite, as appropriate---which enables it to do fast search on metadata as well as structure (shape, chunk, dtype). Here, the database is a pure cache; it could be blown away and reconstructed perfectly. The ground truth is in the files.

About a year ago, having got that first use case about right, we grew into the use case, "I am now ready to trust Tiled as the keeper of ground truth. I will upload my data into Tiled, and let Tiled internally manage the storage." We use the exact same SQL tables as we do for the "passive cache" mode; the difference is just a flag indicating whether a given node is "internally manged" by Tiled---where Tiled owns the ground truth---or "externally managed" (read only).

What is the process like for updating the Tiled catalog?

When users write data into Tiled, there is a POST declaring the new dataset's metadata and structure, followed by one or more PUT requests, potentially in parallel, uploading chunks or partitions of data. The data may be transmitted in many formats; Tiled will transcode and store it in preferred, performant formats (e.g Zarr, Parquet) transparently to the client. We currently do not support resizing or changing the structure after the initial POST. That will be of interest in the future, especially in the context of streaming data in during an experiment. We have discussed a notion of "commit" https://github.com/bluesky/tiled/issues/386. I think 99% of our data sets will only be written to by a single user, but as we firm up this writing capability and make it robust, something in the realm of serializable isolation seems unavoidable.

We are still in the process of understanding our requirements, but I think we are on a trajectory toward a data service backed by a database tracking a mixture of externally-managed (e.g. detector-written) data and internally-managed (user-uploaded) data.

We've gotten a lot of value from chatting with other groups to reflect on how our requirements relate to others' and what opportunities we may be overlooking in both how we frame the problem and build our solution. We'd be very interested in continuing this conversation with you, @jhamman, and any others.

bluesky / tiled

Learning from Arraylake #575