bluesky / tiled

API to structured data
https://blueskyproject.io/tiled
BSD 3-Clause "New" or "Revised" License
59 stars 49 forks source link

Roadmap to storing Bluesky data in Tiled SQL + files #656

Open danielballan opened 8 months ago

danielballan commented 8 months ago

Milestones:

  1. Upload RunStart + RunStop metadata to Tiled container nodes. This has been demonstrated, and motivated some fixes.
  2. Download assets (raw) + event data (as Arrow) and upload it into a Tiled server. This requires nodes backed by multiple data sources, a mixture of internally-managed (event data) and externally-managed (detector data).
  3. Stream Bluesky documents from a Bluesky RunEngine into a Tiled server. Let the client handle batching, at least at first. This requires a PATCH endpoint, with a binary payload in the request body and some query parameters (to be defined) to specify that the data is being appended and, perhaps optionally, where we thing we are picking up from.
  4. Consider whether the server should be able to help with / insist on reasonably good batching/chunking for batch read performance. (Keep in mind that the current solution, mongo_normalized, has very poor chunking, so the bar is low.)

The above was developed in separate conversations with @tacaswell and @dylanmcreynolds.

danielballan commented 8 months ago

Next steps, developed via Slack discussion with @tacaswell and bounced off of @padraic-shafer on a short call:

This might be everything we need for an MVP streaming Bluesky documents into Tiled, with event data being appended in a table data source backing a container node, and with detector data being registered or uploaded as an array node that is a child of the same container.

danielballan commented 7 months ago

I am rethinking this:

Enable the client to declare a container node that is backed by N data_sources.

If the container is backed by a table and an array, these two things have different structures and performance characteristics. Eliding them into one logical entity is hard, as I found when trying to actually implement it.

In discussion with @padraic-shafer, we developed the idea of simply exposing the event data and the external data as separate child nodes. The event data is very naturally a table. The external data nodes may be whatever they need to be: array, sparse, or even another table.

c[uid]['primary']['data']['table']  # table of event data
c[uid]['primary']['data']['fccd_image']   # detector array data
c[uid]['primary']['data']['pe1_imge']  # another detector array data
c[uid]['primary']['data']['tpx']  # sparse detector data

You could still just get everything into an xarray like this:

c[uid]['primary']['data'].read()

This may require a backward-incompatible change to databroker.mongo_normalized.

danielballan commented 7 months ago

We would need to make table a reserved word in the data_keys namespace of the event model.

padraic-shafer commented 7 months ago

We would need to make table a reserved word in the data_keys namespace of the event model.

Ahh, now I'm seeing where these two worlds currently clash. Event model has a flat namespace of data_keys that includes the scalar column names and the detector names. It doesn't distinguish the scalar data table and the detector data.

I think this aspect needs to be poked at some more. I'll gather some background info and come back to this later.

danielballan commented 7 months ago

Yes, that is a significant wrinkle in this. Now I'm think about "transparent" / "anonymous" child nodes again....

danielballan commented 7 months ago

To summarize proposals so far:

  1. Container nodes backed by multiple data sources
  2. Just expose the extra layer of nesting (table, array_detector1, array_detector2, ...) to the user.
  3. Mark some nodes as "transparent" / "anonymous", where listing the parent will just show the grandchildren (column names of the table instead of "table").
  4. Add a new structure family, something like union.
danielballan commented 7 months ago

I think (2) and (4) are strongest in my mind at the moment. Either reject the added complexity (2) or handle it explicitly (4) rather than making containers that sometimes act funny (3).

padraic-shafer commented 7 months ago

Here are some use cases I think that we want to handle. Some of them are already implemented, at least in part. I'm hoping that this helps frame the changes we want to make in Tiled (and maybe in Bluesky).

case HTTP API python API
Naively fetch a "super-table" -- all data sources merged into one table. Perhaps. "Large" data like detector images could be represented by asset URL. In general, do this for any child backed by an external asset? Yes. Dask DataFrame seems like a reasonable choice for making this transparent with lazy fetching.
Fetch all scalar table data -- omit any "large" data arrays.

Investigate datasets with without incurring major overhead
Yes. See comment above about (optionally) including links to large data. Yes. Essentially same as HTTP response.
Fetch one or more detector arrays with "context".

Primary focus is the detector array(s) but with rich data from the scalar table. Essentially the tabular data is "metadata" for the images.
Probably. Would expect encoded equivalent of what python client returns. Yes. Return xarray Dataset or DataArray
Fetch one or more detector arrays with no additional "context".

Primary focus is the detector array(s) for numerical processing. Contextual data has already been handled "externally".
Yes. This is "/array/full". Yes. This is the equivalent of "/array/full".

I am very interested in hearing from others if these uses cases and suggested output seem about right, or what edits are needed.

cjtitus commented 7 months ago

I strongly support "Fetch one or more arrays with context", but I think more work needs to be done on Bluesky/Databroker for this.

As an example, I operate a spectrometer. At each motor step, the detector outputs a 1-d array, which should be labelled by the detector energy bins (it's essentially an MCA). However, these bin values are stored as configuration, since they aren't read every step. So if you look at the detector data in tiled, the xarray for the detector has dimensions [time, dim0], where dim0 is just some useless placeholder index.

I would love the dimensions to be [time, energy], but as far as I can tell, there is no way to do this assignment in bluesky and get the detector axis into xarray. But this is precisely the "rich data" that I most want to return!

danielballan commented 7 months ago

This comes from Ophyd as part of the information reported by describe(). Alongside dtype and shape, it may optionally specify dims, a list of names, perhaps ["energy"] in your case.

if dims is unspecified, Databroker falls back to filling in dim{N}. This feature has been around for awhile but is not yet widely used because Databroker 2.x and Tiled are still in pre-release.

cjtitus commented 7 months ago

Is that documented anywhere? I see no mention of it in https://blueskyproject.io/bluesky/hardware.html and I haven't seen anything in the Ophyd documentation (which is, honestly, so confusing that I normally skip it and go straight to github). Do the dims actually pull in data from config? It's not useful to have a field that just "happens" to have the right name if it's not actually connected to the data.

ambarb commented 7 months ago

Late to comment, but the word "Stream" in the initial description and @padraic-shafer 's description of "Fetch one or more detector arrays with "context" " resonate with me. I agree with some points of @cjtitus

The relevant context if it is a "slow" stepwise scan (meaning motor motion required and/or detector counting time > 1s) can include:

essential detector descriptors for some

essential detector descriptors for all

essential complementary data fields for some

The relevant context if it is a fast

see above unique to "fast" but isn't necessarily mutually exclusive to a slow "in-situ" experiment

In this last point, the consequence right now is any real "streaming", even if the data collection take 1 hour, isn't possible using databroker. I am guessing tiled hasn't solved this, but I could be wrong. Maybe this isn't for tiled to solve, but then bluesky should. But not in a way that that it breaks the "slow" scans that happen for alignment and complimentary data collection.