Adding support for detailed and structured data types

tacaswell commented 3 years ago

At one of the beamlines at NSLS-II we have ended up with handler that in returning a data-frame instead of an array. For the data that it is loading this is quite natural (an area detector plugin that does peak-finding/centroiding on the images for single-photon counting), however the core of the problem here is that as-written this handler is not consistent with the descriptor and is currently un-describable.

The Document Model promises that if you look at the descriptor you will know what the name, type and shape of the data that will be in each event will be (e.g. "there is a field 'x' and it is integers", "there is a field called 'img' and it is a [5, 1028, 962] array"). With in the vocabulary that we have in the descriptor we can not say "within each event you find a field call 'foo' that is a table that has the columns ..." . This is in part because one of the key assumptions we made when developing the document model is that the Event is the lowest level that is allowed to have any structure other than being a homogeneous array.

The current handler is "working" because we previously did not actually enforce that the descriptor was not lying to us (the latest round of databroker work + tiled + dask is finally making use of the descriptor and we are discovering all of the places where we had shape miss-matches). Given the possibly very wide ranging impacts, the mildly existential scope, and the obvious importance of this we should think carefully about this and make sure we get it right. I can see a couple of possible ways out of this:

make major changes to descriptors and allow them to be (infinitely) recursive. This was the thing that we were most wary of back when we did the initial definition of the Document Model because (infinitely) recursive definitions of heterogeneous structured data seems like a very hard, still unsolved, problem (the way tiled communicates the structure of the dataframes is to send a 0 row data frame and then let arrow do its thing). I am not enthusiastic about this idea (it will be a lot of work and I stand by many of the reasons we did not want to do this back in the begining)
make a minor change to the descriptor to allow numpy-style structs as the "type". This would give us enough to fit the tables we have coming out of this handler but keeps us from going down the recursive rabbit hole. It also means that, leaving pandas aside, we can now describe every possible numpy array in the document model (which seems like a good idea). This fits with other idea we have kicked around a bit that we need to be able to specify more correctly what type of int / float a value is (because the vocabulary we use is the jsonschema ones which gives us "integer" and "number" not uint64LE. If we did this and also embraced the variable length, that would accelerate our need to embrace awkardarray (which is good). I believe that this would also be enough to resolve the issues with RGB images that have come up recently.
We could treat the centroid like a flier and put the each reading in its own stream (and maybe include the stream name in the primary event stream?) and each of x, y, .. in a field. This had the upside that it requires no major changes to the model, some minor changes to the handler, but as written would likely have some performance issues (that are issues we want to fix eventually anyway). Currently one datum fills in exactly one field in one event, we probably want to expand the notion of what can be done with a handler to allow it to fill in more than one column and more than one event at a time. This require some major changes to the analysis code, but we should be able to do a database migration and "promote" the old data to the new format.
Expand each column in the table to field on the top-level event. This is probably the least disruptive of all of the options and will require only a little bit of tweaking to the handler and to the resource/datum documents that the device produces. Like option 3 this will require some changes to the analysis code (only to change some names / level of nesting of access) and as with 3 we can do a migration to make the old data look correct

After some internal discussion at NSLS-II we are leaning towards option 2. I think the steps here are:

adding a new (optional to start with) field to the descriptor name "detailed_dtype" or "full_dtype", name negotiable
adopt (at least) one of the spellings of numpy dtypes xref https://numpy.org/doc/stable/user/basics.rec.html#structured-datatypes and https://numpy.org/doc/stable/reference/arrays.dtypes.html#arrays-dtypes-constructing for details. I am not sure we want to support all of the ways of spelling the dytpe (to make it easier to implement non-python clients).
double-down on the idea that a -1 in the shape means "unknown dimensions`

In this proposal the data key for a field that is a table would look something like:

{
   ...,
   'centroids': {
      ...,
      'shape': [-1],
      'dtype': 'array',
      'detailed_dtype': [('x', 'f4'), ('y', 'f4'), ('intensity', 'f8')],
      ...
   }
}

which says "Each event in the event stream has a field called 'centroids' which is an array of unknown length containing elements that are 2 32 bit floats with the names ('x', 'y') and a 64bit float with the name 'intensity')".

If we assume the color axis is last in a color time series than we could say

{
   ...,
   'color_video': {
      ...,
      'shape': [1024, 926],
      'dtype': 'array',
      'detailed_dtype': [('R', 'u1'), ('G', 'u1'), ('B', 'u1')],
      ...
   }
}

which says "Each event in this event stream contains a field called 'color_video' that is a 1024 by 926 array and each element of the array is an RGB tuple of unsigned 8bit integers"

numpy also lets you put arrays inside of your structured types so an alternate way of spelling the first case is

{
   ...,
   'centroids': {
      ...,
      'shape': [-1],
      'dtype': 'array',
      'detailed_dtype': [('position', 'f4', (2, )), ('intensity', 'f8')],
      ...
   }
}

which says "Each event in the event stream has a field called 'centroids' which is an array of unknown length containing elements that are a 2-tuple of a 2-tuple of 32 bit floats with the name 'position' and a 64bit float with the name 'intensity')". I think in this case the first one is better (because the (x, y) vs (row, col) issue haunts my dreams), but it is worth noting this sort of thing is possible.

[edited to remvove half-finished thoughts that will become new post...the ctrl-enter vs enter for new line vs post in gh vs slack vs gmail vs ... is annoying]

tacaswell commented 3 years ago

This would give us 5 levels on which we can organize our data with in a single run

what run is it in
which stream is it in
the fields in the stream
the shape of the data array in a field
a struct inside of the element of the array in each event in the stream

For any given data, it should always be possible to move the structure up or down a level of organization. For example if we take the simple case of a motor and a point detector and running a step scan where we want to do multiple sweeps.

At the run level we could do one run per sweep or one run with a stream per-sweep. Both are allowed within the model, but they have different access patterns (because databroker is built around the concept of access to a run, the searches are currently built around the start document, and the uids / scan_ids are per-start). So if you are always going to do 10 sweeps and you are always going to want to pull up all ten together, then maybe naming your streams f'pass_{n}' and writing your analysis code to process all of the streams who's names start with 'pass_' is the right thing to do. On the other hand, maybe you normally only do 1 sweep and it makes more sense to have each sweep be in its own run (with stream name 'primary') and when you need to aggregate sweeps you can do that by looping over runs in the analysis code.

If we now look at the case of one-sweep per run, we still have some choices in how to structure the streams. The model allows you to have one stream per point (I personally think it is a bad idea, but do not know how to encode "Tom thinks it is a bad idea in a schema" ;)) or one stream with all of the points in it.

The model also allows you to have 2 streams: one for the detector and one for the motor. You could then at analysis time "know" that you need to zip these two stream together (again I think this is a bad idea), or you could put them in the same stream and let the fact that they are in the same event tell you that they should be zipped together.

If we adopt this proposal then we have one more option, we could have a field for the motor and a field for the detector in events in the stream, or we could have one field that uses a structured data type to sticky take them together.

The common theme of all of these is how much "pre-aggregation" that we are letting the data structures do for us. In all of the cases above, we can make it "work", but some of the choices are going to be more painful, both in terms of programming against them and it terms of performance. This pain can show up both in terms of having to do too much "zipping" in analysis code or too much "pulling apart".

Another issue we need to think about is how to handle row-major vs column-major in the case where the shape is 1d. I am relatively sure that in the case of 1D (variable length), this description works just as well for dataframes (and other notionally collumnar data structures) as it does for record arrays, however the python-side data structures that represent these things are not the same / particularly interchangeable. I also see very strong arguments for providing the option to either.

In the case of the detector that prompted this discussion, iirc, we have an hdf5 file that has 3 data sets which is collumanr data. Taking that and re-packing it into a row-major data structure before sending it back to the user (who is likely to transpose it back) seems daft. On the other hand, a different detector I have worked with has a native data layout of c-structs that pack (energy, time, position). Reading all of that in to transpose it to collumnar before sending back to the user.

tacaswell commented 3 years ago

Attn @thomascobb, @callumforrester, @lsammut , @clintonroy (whom it will not let me assign).

tacaswell commented 3 years ago

Another thought that I missed earlier, embracing the variable length and extra structure makes akward array look a lot more promising (and pushes us towards data patterns that are friends in the HEP world also have).

danielballan commented 3 years ago

My sense from the Pilot call was that folks see this as a positive change and a natural extension of what we have. Specific points:

@klauer mused that this might make it easier to capture alarm states (alongside values, timestamps) which is a long-running wish/concern he has had.
@ZLLentz commented that this is not really as scary as it might sound here, and we shouldn't give too much pause.
@HarinarayanKrishnan shared some thoughts on data description in URIs generally.

untzag commented 3 years ago

Expand each column in the table to field on the top-level event. This is probably the least disruptive of all of the options and will require only a little bit of tweaking to the handler and to the resource/datum documents that the device produces. Like option 3 this will require some changes to the analysis code (only to change some names / level of nesting of access) and as with 3 we can do a migration to make the old data look correct

Can anyone explain what downsides there are to this approach? I think NumPy structured arrays are cool, but I'm skeptical that Bluesky needs them.

callumforrester commented 3 years ago

Further to a discussion from a few months ago with @tacaswell and @danielballan, does this help to batch large amounts of asynchronously captured data into events or would you still use event paging for that?

tacaswell commented 3 years ago

@callumforrester If your large patches can be written as a block that can then be described as a structured array: yes. You would still also have events and events can still be packed into pages. This now lands you on a datastructure that fits very badly into either a data from (they really really want the type of the values in the columns to be simple scalars) or an xarray (which wants to think of data as a regular cube with labeled axis).

As mentioned above, I think the escape hatch here is https://awkward-array.readthedocs.io/en/latest/ (out of the HEP community) which handles this case extremely well (they have lots of aggregated by bin data with variable length fastest-axis).

I case where this would work well is fly scans where some hardware system is coordinating (x, y, t) and triggering a camera you could have an event with 2 fields ["image", "the_data_from_hardware"] with the data keys (assuming 128 points in a line)

{
   "image": {
      "dtype": "array",
      "detailed_dtype": "u4",
      "shape": [128, 2028, 1024],
      "external": "FILESTORE:",
   },
   "the_data_from_hardware": {
       "dtype": "array", "
       detailed_dtype": [["x", "f4"], ["y", "f4"], ["time", "u4"]],
      "shape": [128,],
      "external": "FILESTORE:",
   }
}

and then "number of rows" events. As @untzag points out option 4 is not so bad in this case, you promote each of "x", "y", and "t" (which is some cases may be better from a data access for analysis point of view!), however there are some cases where is problematic from both an implementation and an conceptual stand point.

From an implementation point of view this means that instead of having 1 resource per run and 1 datum per event well have at least 3 datum per event (and 3 trips through the filler machinery). We have found that this can be a major performance bottle neck.

From a conceptual point of view, lets look at either the motivating case here (an in-detector feature detector for locating single-photon spots so we get (x, y, intesity)) or a pixelated energy sensitive photon counting detector (where we get a stream of (index, timestamp, energy)). If we then want to use these detectors in a step scan so we would have datakeys that look like

{
   "x": {
      "dtype": "number",
      "detailed_dtype": "f4",
      "shape": [],
   },
   "found_centroids_1": {
       "dtype": "array", "
       detailed_dtype": [["x", "f4"], ["y", "f4"], ["intensity", "u2"]],
      "shape": [-1,],
      "external": "FILESTORE:",
   }
}

That is "at ever point we measure the position of the x-stage and the (variable) number of photon hits on the camera". This makes a point about giving us another nested namespace (which de-conflicts the 'x'). Assuming we could solve that, we would then have 1 scalar key and 3 variable length keys that we don't really know how they go together, but maybe assuming you can zip all of the non-scalar arrays. However, that assumption falls apart as soon as you add 'found_centroids_2' from a second detector. If you had projected the "inner" columns up to the event you would then have 6 variable columns that should be group by 3s and the data structure does not tell you which 3 should be grouped together. You could start doing heuristics / semantics in the name, but that violates one of the design principles of Bluesky (the names mean something to the humans but not to the computer) [to be fair, we do violate this a bit by banning '.' due to Mongo...but that is because in the case of Mongo the strings do mean something to the computer]. Hence, I think that without structured arrays there are "natural" groupings in the data that we will not be able to express.

Adopting structured arrays also lets up punt a bit longer on sorting out how to reform Resource/Datum to be able to fill more than one field / one event at a time.

I think NumPy structured arrays are cool, but I'm skeptical that Bluesky needs them.

This is the right attitude! We have made it this far without them. To some degree my publicly verbose comments on this are as much about convincing my self this is a good idea as anyone else ;)

Getting a bit further ahead, looking at https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_numpy.html#ak-from-numpy they have a bool to pick between putting a special layout machinery on top (to be aware of the columns in the dtype) or just blindly treat it like a numpy array. They have come to regret that decision and in 2.0 are going to always force recordarray to be True as having somethings "base" types and sometimes structured types was bad.

This also goes to a discussion I had with @danielballan about adding a StructuredArray to tiled to handle this. I'm going to reverse my private comments and be very supportive of this 👍🏻. Although you can make the case that at the c/c++ level "base" arrays and "structured arrays" are very similar (it is all pointers, stride math, and maybe a cast), at the semantic (and Python) level they are very different and should be treated as such.

We want to keep them distinct from DataFrames because once you go to a DataFrame you are locking your self into 1 primary axis / index and lots of assumptions about by-block columnar access. In the case of a record array we want to be able to access Nd chunks. It happens that all of the examples that we have gone through in this thread an 1-D at the outer array level, but that may not always be the case (think a 2D-fly scan with a 2D array of (x, y, I) in each event and then running an outer step scan around that (think xanes mapping or florescence tomography (or xanes tomography for the very patient)).

untzag commented 3 years ago

Thanks for your further explanation @tacaswell. It seems like you reject option 4 (Expand each column in the table to field on the top-level event) because it restrains Bluesky to one set of keys per event. The nested "event field"/"array field" structure proposed in option 2 is preferable because it retains useful information about the relationship between arrays. That makes sense to me. If the complexity is truly present in the data we shouldn't try to hide it by forcing a simpler datastructure.

@ksunden and I have solved similar problems by inserting extra axes such that the broadcast rules tell us all that we need to know about array relationships. However that only works for well-behaved data, I think Bluesky looks to support more complex cases including fully asynchronous & unstructured stuff.

Can we still "make the easy things easy" after this change? My personal focus is on creating an orchestration and analysis layer that "just works" for the very simple experiments we undertake. I worry that some of the live visualization and processing will not be compatible with Structured Arrays. Will hints remain on the event field level? How will BestEffortCallback use this extra namespace layer, or will Structured Arrays simply be ignored?

untzag commented 3 years ago

(a small, rough thought)

When combining data from multiple devices into a single event document Bluesky does currently prepend device name to create a totally flat namespace. The information about which device sourced the data is stored in object keys [1]. I remember not liking this approach when I first tried to access data in Bluesky. Anyway, perhaps it's useful to compare and contrast this behavior with the behavior proposed here, since both seem to boil down to trying to preserve useful structure without adding complexity.

[1] https://blueskyproject.io/bluesky/event_descriptors.html#object-keys

tacaswell commented 3 years ago

Will hints remain on the event field level?

That is a very good question. I'm optimistic we can find some way to spell (optionally) digging down, but it is not obvious to me yet.

Can we still "make the easy things easy" after this change?

Yes, this should be a mostly opt-in functionality. The place where it is not optional is that if you have document consumers that only look at dtype then there is a chance that something with a structured array will make it through and explode. However, this is also currently true if an array of unusual type (strings or objects) were to make it through so the addition of the detailed dtype allows you to nope out of trying to handle data you do not know how to handle / expect (very much like LiveTable drops anything that is not a scalar on the floor).

When combining data from multiple devices into a single event document Bluesky does currently prepend device name to create a totally flat namespace

The prepending happens in the ophyd object, rather than in the RunEnigine (the dual use of 'bluesky' is extra confusing here). We did not want to force any naming schemes on the keys, but also knew that we needed them to be unique when multiple devices were read. Grabbing the the {highest_parant.name}_{attrname}_{attrname} as the default name on init was a good balance of "probably process unique" (when paired with the pattern dev = Device(...., name='dev')) and "human readable". However, the names of every device and component can be change at run time (by obj.child.child.name = 'bob') so if the user wants to set them selves up for name collisions they can ;)

The existence of the objects-keys mapping in the descriptor is so that you can reconstruct what was in any given device without relying on the heuristics of the names (you probably should have systematic names but that is for the humans not the computers).

The object keys mapping existing is a way out of my "two sets of variably length arrays" above at the cost of saying that the ambiguous sets can not come from the same device. However that feels a bit wrong to me as it is adding extra constraints on what the shapes of the values of the readings are.

tacaswell commented 3 years ago

So, while trying to write a json schema to limit the detailed dtype it turns out that opening the door to numpy structured datatypes opens the door to infinitly deep structured data:

np.dtype(
    [
        (
            "a",
            [
                ("b", "u1"),
                ("c", "f4"),
            ],
        ),
        ("d", "c16"),
    ]
)

This is a datatype with 2 fields {'a', 'd'} and the 'a' field has two fields {'b', 'c'} 🙃

I think I am comfortable saying "you get one (1) level of structure in event model, if you think you need more lets talk" as a) I really do not want to open the door to infinity deep structures b) it looks like awkward does not either.

tacaswell commented 3 years ago

In a bit more digging, motivated by the assumption that this is a Problem that Someone has solved, it turns out numpy ( https://numpy.org/doc/stable/reference/arrays.interface.html#type-description-examples) and cpython (https://www.python.org/dev/peps/pep-3118/ / https://docs.python.org/3/library/struct.html) have both solved this problem. There is some overlap between the two, they are not identical. There is code at the numpy c level to generate the pep3118 compatible string and a private Python function to build a dtype from a pep3118 spec. While there is no public API for converting between the two you can use public machery to go between them:

In [63]: np.ones([0], np.dtype([('a', float), ('b', int)]))
Out[63]: array([], dtype=[('a', '<f8'), ('b', '<i8')])

In [64]: memoryview(np.ones([0], np.dtype([('a', float), ('b', int)]))).format
Out[64]: 'T{d:a:l:b:}'

In [65]: np.array(memoryview(np.ones([0], np.dtype([('a', float), ('b', int)])))).dtype
Out[65]: dtype([('a', '<f8'), ('b', '<i8')])

With numpy dtypes it is possible to define dtypes with over-lapping or out of order fields however this is not describable with either the descr or pep3118.

It appears that the data-api coalition has not taken on structured data yet: https://data-apis.org/array-api/latest/API_specification/data_types.html

I think the numpy spelling is easier to read for humans, is better documented outside of our projects (the T{...} format and the :name: is only documented in the pep, not in the CPython docs), and is part of the __array_protocol__ family which is supported by other array libraries.

The pro I see for the pep3118 style string is that we can get away with only 1 string.

I think the options for spelling this are:

pep3118 style as 1 extra key
numpy __array_protocol__ style with 2 keys (one always dt.str (a string) and one always dt.descr (a nested list of lists of strings and lists))
numpy __array_protocol__ style with 1 key which is a string from when the type as a built in type and the recursive list-of-lists when it is structured.

Despite the added verbosity, I think that option (2) is the best due to the type stability (thinking a bit ahead to wanting to consume this into c++/js/java).

One thing that we can not directly encode using any of these schemes is "subdtype" which is a way for a dtype to control the last dimensions of an array. However, I do not think that this is actually a problem because when you make an array with a dtype that is a subdtype, the dimensions of the resulting array eats the extra dimensions and reports its dtype as the base dtype of the subdtype:

In [116]: np.zeros((2, 2), np.dtype('3i')).shape
Out[116]: (2, 2, 3)

In [123]: np.zeros((2, 2), np.dtype('3i')).dtype
Out[123]: dtype('int32')

In [124]: np.zeros((2, 2), np.dtype('3i')) == np.zeros((2, 2, 3), 'i')
Out[124]: 
array([[[ True,  True,  True],
        [ True,  True,  True]],

       [[ True,  True,  True],
        [ True,  True,  True]]])

In [125]: np.zeros((2, 2), np.dtype('3i')).dtype == np.zeros((2, 2, 3), 'i').dtype
Out[125]: True

In [126]: np.zeros((2, 2), np.dtype('3i')).shape == np.zeros((2, 2, 3), 'i').shape
Out[126]: True

thomascobb commented 2 years ago

I think all the flyscan use cases I have can can be solved without structured data, but I guess this would be useful to have this as an escape hatch.

For example, consider a 512x512 detector that produces data at 10kHz, then a PandABox that produces X, Y, T for each of those events. For the detector, I would produce a single event page once a second of shape (~10000, 512, 512). It's approximately 10000 frames as different detectors have different readout rates, so I'd rather not wait for exactly 1000 frames. For the PandABox, it produces its data in a row major format, so I could either produce an event page of shape (~10000, 3) in native format (which would need your structured data changes), or unpack into 3 event pages of shape (~10000,). We currently intend to do the latter as it maps better to an HDF file. I would be inclined to keep on doing that, as it means that if we produce X and Y on a different PandABox to T, it would be transparent to Analysis as they would be 3 different streams.

Bilchreis commented 8 months ago

Is this feature still under active development? I would be very interested in having support for structured data types.

danielballan commented 8 months ago

It looks like the formal specification languished in https://github.com/bluesky/event-model/pull/215, but we are in fact using dtype_str on the experimental floor at NSLS-II to place structure data in Bluesky documents.

Bilchreis commented 8 months ago

Thanks Daniel, ill conform to the spec in #215 for now.

bluesky / event-model

Adding support for detailed and structured data types #214