Open danielballan opened 4 months ago
Suggest dtype_numpy
, and have str
as the json schema type. This means that if you put garbage
in then the json schema says everything is fine, then something downstream might fall over and you will get no early warning, but that is probably ok
EventDescriptors specific that data type of the data in Events and StreamResources in the key
dtype
using jsonschema datatypes:https://github.com/bluesky/event-model/blob/145118722a2e2f15abf32010e3e6b71506398140/event_model/schemas/event_descriptor.json#L47-L53
This is the tragedy of, "You must define the core of your software at the beginning, when you understand the problem the least." At the time (2015) we were focused on MongoDB (bson) and Python applications, where data types can be coarsely defined. We now view this as a mistake. The
array
option is particular does not make sense: we haveshape
for that. We should have given specific types.How should we add them now?
Decision: New key or expand
dtype
enum?If we expand the
dtype
enum to optionally specify a specific data type instead of the jsonschema types, this could break downstream consumers (some in code that we do not know about) that have been able to expect the jsonschema types for the last ~8 years. It seems safest to add a new key sitting besidedtype
. In "Bluesky 2.0" this could be cleaned up / consolidated and documented as a backward-incompatible change.Decision: How to spell the data type?
Three ideas have been proposed:
"<f8"
,">i4"
,"|b1"
. There is precedent for using this as a way of encoding numpy data types in JSON: the Zarr v2 spec does so.float64
,int8
,bool
. Types are little-endian. Big-endianness is a handled as a property of the encoding (a codec).As of May 9, Zarr v3 is still just a specification, with a Python implementation still in progress so it feels a bit early to hitch our wagon to that standard. The Numpy strings, while not exactly a "specification", have been around a long time and are unlikely to change. My (loosely-held) view is that we should use Numpy strings but leave open the possiblity of adopting something different, hopefully something formally specified, in the future.
Decision: What to call the new key?
Ideas proposed:
dtype_str
dtype_numpy
dtype_zarr2
datatype
I think having both
dtype
(jsonschema legacy) anddatatype
(new thing) together would be confusing. (I would be in favor of consolidating on something likedatatype
in Bluesky 2.0.)One advantage of something specific like
dtype_numpy
is it would let us add Zarr v3 or something else in the future unambiguously.Status Quo
On the floor at NSLS-II, we have been using the key
dtype_str
and the Numpy typestr spellings. This solves the practical problem that Tiled needs to know the real data types in order to inform clients so they can pre-allocate numpy or dask arrays to download chunks of data into.But
dtype_str
was never added to event-model or formally decided. The goal of this issue is to make a decision and add something to the event-model schema.