Open danielballan opened 8 months ago
Next steps, developed via Slack discussion with @tacaswell and bounced off of @padraic-shafer on a short call:
structure_family
column to the data_sources
table. This information is implicitly knowable but introspecting the structure
, but it will be useful to have it explicitly available. https://github.com/bluesky/tiled/pull/659container
node that is backed by N data_sources
. (To start limit these data sources to table
structure family, but this could potentially include some `awkward structures in the future.)data_sources
part of the response reveals that they come from different places and have different performance characteristics.)PUT /table/full/{path}?data_source={data_source_id}
. For nodes backed by just one data source, this may be optional for simplicity and back-compat.PATCH /table/full/{path}?append
. Notice that the endpoint /table/...
corresponds to the structure family of the data source (and may continue to grow) the appropriate query params and formats supported, not /container/...
.This might be everything we need for an MVP streaming Bluesky documents into Tiled, with event data being appended in a table data source backing a container node, and with detector data being registered or uploaded as an array node that is a child of the same container.
I am rethinking this:
Enable the client to declare a
container
node that is backed by Ndata_sources
.
If the container is backed by a table and an array, these two things have different structures and performance characteristics. Eliding them into one logical entity is hard, as I found when trying to actually implement it.
In discussion with @padraic-shafer, we developed the idea of simply exposing the event data and the external data as separate child nodes. The event data is very naturally a table. The external data nodes may be whatever they need to be: array, sparse, or even another table.
c[uid]['primary']['data']['table'] # table of event data
c[uid]['primary']['data']['fccd_image'] # detector array data
c[uid]['primary']['data']['pe1_imge'] # another detector array data
c[uid]['primary']['data']['tpx'] # sparse detector data
You could still just get everything into an xarray like this:
c[uid]['primary']['data'].read()
This may require a backward-incompatible change to databroker.mongo_normalized
.
We would need to make table
a reserved word in the data_keys
namespace of the event model.
We would need to make
table
a reserved word in thedata_keys
namespace of the event model.
Ahh, now I'm seeing where these two worlds currently clash. Event model has a flat namespace of data_keys that includes the scalar column names and the detector names. It doesn't distinguish the scalar data table and the detector data.
I think this aspect needs to be poked at some more. I'll gather some background info and come back to this later.
Yes, that is a significant wrinkle in this. Now I'm think about "transparent" / "anonymous" child nodes again....
To summarize proposals so far:
table
, array_detector1
, array_detector2
, ...) to the user.table
instead of "table"
).union
.I think (2) and (4) are strongest in my mind at the moment. Either reject the added complexity (2) or handle it explicitly (4) rather than making containers
that sometimes act funny (3).
Here are some use cases I think that we want to handle. Some of them are already implemented, at least in part. I'm hoping that this helps frame the changes we want to make in Tiled (and maybe in Bluesky).
case | HTTP API | python API |
---|---|---|
Naively fetch a "super-table" -- all data sources merged into one table. | Perhaps. "Large" data like detector images could be represented by asset URL. In general, do this for any child backed by an external asset? | Yes. Dask DataFrame seems like a reasonable choice for making this transparent with lazy fetching. |
Fetch all scalar table data -- omit any "large" data arrays. Investigate datasets |
Yes. See comment above about (optionally) including links to large data. | Yes. Essentially same as HTTP response. |
Fetch one or more detector arrays with "context". Primary focus is the detector array(s) but with rich data from the scalar table. Essentially the tabular data is "metadata" for the images. |
Probably. Would expect encoded equivalent of what python client returns. | Yes. Return xarray Dataset or DataArray |
Fetch one or more detector arrays with no additional "context". Primary focus is the detector array(s) for numerical processing. Contextual data has already been handled "externally". |
Yes. This is "/array/full". | Yes. This is the equivalent of "/array/full". |
I am very interested in hearing from others if these uses cases and suggested output seem about right, or what edits are needed.
I strongly support "Fetch one or more arrays with context", but I think more work needs to be done on Bluesky/Databroker for this.
As an example, I operate a spectrometer. At each motor step, the detector outputs a 1-d array, which should be labelled by the detector energy bins (it's essentially an MCA). However, these bin values are stored as configuration, since they aren't read every step. So if you look at the detector data in tiled, the xarray for the detector has dimensions [time, dim0], where dim0 is just some useless placeholder index.
I would love the dimensions to be [time, energy], but as far as I can tell, there is no way to do this assignment in bluesky and get the detector axis into xarray. But this is precisely the "rich data" that I most want to return!
This comes from Ophyd as part of the information reported by describe()
. Alongside dtype
and shape
, it may optionally specify dims
, a list of names, perhaps ["energy"]
in your case.
if dims
is unspecified, Databroker falls back to filling in dim{N}
. This feature has been around for awhile but is not yet widely used because Databroker 2.x and Tiled are still in pre-release.
Is that documented anywhere? I see no mention of it in https://blueskyproject.io/bluesky/hardware.html and I haven't seen anything in the Ophyd documentation (which is, honestly, so confusing that I normally skip it and go straight to github). Do the dims
actually pull in data from config? It's not useful to have a field that just "happens" to have the right name if it's not actually connected to the data.
Late to comment, but the word "Stream" in the initial description and @padraic-shafer 's description of "Fetch one or more detector arrays with "context" " resonate with me. I agree with some points of @cjtitus
essential detector descriptors for some
essential detector descriptors for all
essential complementary data fields for some
see above unique to "fast" but isn't necessarily mutually exclusive to a slow "in-situ" experiment
In this last point, the consequence right now is any real "streaming", even if the data collection take 1 hour, isn't possible using databroker. I am guessing tiled hasn't solved this, but I could be wrong. Maybe this isn't for tiled to solve, but then bluesky should. But not in a way that that it breaks the "slow" scans that happen for alignment and complimentary data collection.
Milestones:
container
nodes. This has been demonstrated, and motivated some fixes.PATCH
endpoint, with a binary payload in the request body and some query parameters (to be defined) to specify that the data is being appended and, perhaps optionally, where we thing we are picking up from.mongo_normalized
, has very poor chunking, so the bar is low.)The above was developed in separate conversations with @tacaswell and @dylanmcreynolds.