Roadmap to storing Bluesky data in Tiled SQL + files

danielballan commented 8 months ago

Milestones:

Upload RunStart + RunStop metadata to Tiled container nodes. This has been demonstrated, and motivated some fixes.
Download assets (raw) + event data (as Arrow) and upload it into a Tiled server. This requires nodes backed by multiple data sources, a mixture of internally-managed (event data) and externally-managed (detector data).
Stream Bluesky documents from a Bluesky RunEngine into a Tiled server. Let the client handle batching, at least at first. This requires a PATCH endpoint, with a binary payload in the request body and some query parameters (to be defined) to specify that the data is being appended and, perhaps optionally, where we thing we are picking up from.
Consider whether the server should be able to help with / insist on reasonably good batching/chunking for batch read performance. (Keep in mind that the current solution, mongo_normalized, has very poor chunking, so the bar is low.)

The above was developed in separate conversations with @tacaswell and @dylanmcreynolds.

danielballan commented 8 months ago

Next steps, developed via Slack discussion with @tacaswell and bounced off of @padraic-shafer on a short call:

Add a structure_family column to the data_sources table. This information is implicitly knowable but introspecting the structure, but it will be useful to have it explicitly available. https://github.com/bluesky/tiled/pull/659
Enable the client to declare a container node that is backed by N data_sources. (To start limit these data sources to table structure family, but this could potentially include some `awkward structures in the future.)
Nodes that have data sources may still also have child nodes. The keys of the child nodes and the column names in the data sources should be validated to be distinct. When paging through the node's children, we should see the union of these keys, appearing flat. (The data_sources part of the response reveals that they come from different places and have different performance characteristics.)
Require the client to target a specific data source for upload, via PUT /table/full/{path}?data_source={data_source_id}. For nodes backed by just one data source, this may be optional for simplicity and back-compat.
Enable appending rows to a data source, via PATCH /table/full/{path}?append. Notice that the endpoint /table/... corresponds to the structure family of the data source (and may continue to grow) the appropriate query params and formats supported, not /container/....
Enable registering and uploading assets from the client. https://github.com/bluesky/tiled/pull/661

This might be everything we need for an MVP streaming Bluesky documents into Tiled, with event data being appended in a table data source backing a container node, and with detector data being registered or uploaded as an array node that is a child of the same container.

danielballan commented 7 months ago

I am rethinking this:

Enable the client to declare a container node that is backed by N data_sources.

If the container is backed by a table and an array, these two things have different structures and performance characteristics. Eliding them into one logical entity is hard, as I found when trying to actually implement it.

In discussion with @padraic-shafer, we developed the idea of simply exposing the event data and the external data as separate child nodes. The event data is very naturally a table. The external data nodes may be whatever they need to be: array, sparse, or even another table.

c[uid]['primary']['data']['table']  # table of event data
c[uid]['primary']['data']['fccd_image']   # detector array data
c[uid]['primary']['data']['pe1_imge']  # another detector array data
c[uid]['primary']['data']['tpx']  # sparse detector data

You could still just get everything into an xarray like this:

c[uid]['primary']['data'].read()

This may require a backward-incompatible change to databroker.mongo_normalized.

danielballan commented 7 months ago

We would need to make table a reserved word in the data_keys namespace of the event model.

padraic-shafer commented 7 months ago

We would need to make table a reserved word in the data_keys namespace of the event model.

Ahh, now I'm seeing where these two worlds currently clash. Event model has a flat namespace of data_keys that includes the scalar column names and the detector names. It doesn't distinguish the scalar data table and the detector data.

I think this aspect needs to be poked at some more. I'll gather some background info and come back to this later.

danielballan commented 7 months ago

Yes, that is a significant wrinkle in this. Now I'm think about "transparent" / "anonymous" child nodes again....

danielballan commented 7 months ago

To summarize proposals so far:

Container nodes backed by multiple data sources
Just expose the extra layer of nesting (table, array_detector1, array_detector2, ...) to the user.
Mark some nodes as "transparent" / "anonymous", where listing the parent will just show the grandchildren (column names of the table instead of "table").
Add a new structure family, something like union.

danielballan commented 7 months ago

I think (2) and (4) are strongest in my mind at the moment. Either reject the added complexity (2) or handle it explicitly (4) rather than making containers that sometimes act funny (3).

padraic-shafer commented 7 months ago

Here are some use cases I think that we want to handle. Some of them are already implemented, at least in part. I'm hoping that this helps frame the changes we want to make in Tiled (and maybe in Bluesky).

case	HTTP API	python API
Naively fetch a "super-table" -- all data sources merged into one table.	Perhaps. "Large" data like detector images could be represented by asset URL. In general, do this for any child backed by an external asset?	Yes. Dask DataFrame seems like a reasonable choice for making this transparent with lazy fetching.
Fetch all scalar table data -- omit any "large" data arrays. Investigate datasets ~~with~~ without incurring major overhead	Yes. See comment above about (optionally) including links to large data.	Yes. Essentially same as HTTP response.
Fetch one or more detector arrays with "context". Primary focus is the detector array(s) but with rich data from the scalar table. Essentially the tabular data is "metadata" for the images.	Probably. Would expect encoded equivalent of what python client returns.	Yes. Return xarray Dataset or DataArray
Fetch one or more detector arrays with no additional "context". Primary focus is the detector array(s) for numerical processing. Contextual data has already been handled "externally".	Yes. This is "/array/full".	Yes. This is the equivalent of "/array/full".

I am very interested in hearing from others if these uses cases and suggested output seem about right, or what edits are needed.

cjtitus commented 7 months ago

I strongly support "Fetch one or more arrays with context", but I think more work needs to be done on Bluesky/Databroker for this.

As an example, I operate a spectrometer. At each motor step, the detector outputs a 1-d array, which should be labelled by the detector energy bins (it's essentially an MCA). However, these bin values are stored as configuration, since they aren't read every step. So if you look at the detector data in tiled, the xarray for the detector has dimensions [time, dim0], where dim0 is just some useless placeholder index.

I would love the dimensions to be [time, energy], but as far as I can tell, there is no way to do this assignment in bluesky and get the detector axis into xarray. But this is precisely the "rich data" that I most want to return!

danielballan commented 7 months ago

This comes from Ophyd as part of the information reported by describe(). Alongside dtype and shape, it may optionally specify dims, a list of names, perhaps ["energy"] in your case.

if dims is unspecified, Databroker falls back to filling in dim{N}. This feature has been around for awhile but is not yet widely used because Databroker 2.x and Tiled are still in pre-release.

cjtitus commented 7 months ago

Is that documented anywhere? I see no mention of it in https://blueskyproject.io/bluesky/hardware.html and I haven't seen anything in the Ophyd documentation (which is, honestly, so confusing that I normally skip it and go straight to github). Do the dims actually pull in data from config? It's not useful to have a field that just "happens" to have the right name if it's not actually connected to the data.

ambarb commented 7 months ago

Late to comment, but the word "Stream" in the initial description and @padraic-shafer 's description of "Fetch one or more detector arrays with "context" " resonate with me. I agree with some points of @cjtitus

The relevant context if it is a "slow" stepwise scan (meaning motor motion required and/or detector counting time > 1s) can include:

essential detector descriptors for some

ROI addresses
concatenation parameters, image flipping parameters (things in the PROC AreaDetector plugin)
any fancy datastore integration of dark or flatfield streams (i.e., bluesky dark frames)
masks (e.g., bad pixels, beamstop, labeled array for processing)
pixel 0,0 address for the direct x-ray beam position(some AreaDetector plugins feature this PV, but not all)

essential detector descriptors for all

exposure time (acquire_time)
detector rep rate or duty cycle (acquire_period)

essential complementary data fields for some

the slow "motor" positions that are scanning (could be a scalar like applied voltage or magnetic field or laser on/off)
children of the detector (any thing set up with the TTL pulse of the detector that is ALSO COUNTED) - but this could be a sometimes give me this data
bluesky monitor (or flyer if you don't mind the additional time at the end for writing)

The relevant context if it is a fast

see above unique to "fast" but isn't necessarily mutually exclusive to a slow "in-situ" experiment

use cases: time-resolved, pump (and then also pump-probe), experiments using the accelerator timing, time-based fly-scanning, time-series to extract dynamics with or without external forces
accurate time stamping isn't always possible. to cope, people rely on the highly precise duty rate of a detector and request multiple frames in one point. With accurate time-stamping, others want to collect thousands of images with a kHz duty cycle. Probably better to just request 1 bluesky point in that case

In this last point, the consequence right now is any real "streaming", even if the data collection take 1 hour, isn't possible using databroker. I am guessing tiled hasn't solved this, but I could be wrong. Maybe this isn't for tiled to solve, but then bluesky should. But not in a way that that it breaks the "slow" scans that happen for alignment and complimentary data collection.

bluesky / tiled

Roadmap to storing Bluesky data in Tiled SQL + files #656

The relevant context if it is a "slow" stepwise scan (meaning motor motion required and/or detector counting time > 1s) can include:

The relevant context if it is a fast