Add specific data type - Githubissues

EventDescriptors specific that data type of the data in Events and StreamResources in the key dtype using jsonschema datatypes:

https://github.com/bluesky/event-model/blob/145118722a2e2f15abf32010e3e6b71506398140/event_model/schemas/event_descriptor.json#L47-L53

This is the tragedy of, "You must define the core of your software at the beginning, when you understand the problem the least." At the time (2015) we were focused on MongoDB (bson) and Python applications, where data types can be coarsely defined. We now view this as a mistake. The array option is particular does not make sense: we have shape for that. We should have given specific types.

How should we add them now?

Decision: New key or expand `dtype` enum?

If we expand the dtype enum to optionally specify a specific data type instead of the jsonschema types, this could break downstream consumers (some in code that we do not know about) that have been able to expect the jsonschema types for the last ~8 years. It seems safest to add a new key sitting beside dtype. In "Bluesky 2.0" this could be cleaned up / consolidated and documented as a backward-incompatible change.

Decision: How to spell the data type?

Three ideas have been proposed:

Use the NumPy array protocol type string (typestr) format, e.g. "<f8", ">i4", "|b1". There is precedent for using this as a way of encoding numpy data types in JSON: the Zarr v2 spec does so.
Use the newer Zarr v3 specification, which opts for a more constrained set of supported types with more human-readable names, e.g. float64, int8, bool. Types are little-endian. Big-endianness is a handled as a property of the encoding (a codec).
Use Arrow, which supports a super-set of these types. However, Arrow has no officially-supported JSON encoding. Its schema is binary; it would be have to be base64-encoded or similar---not human-readable. For that reason, I think it is easy to reject this option.

As of May 9, Zarr v3 is still just a specification, with a Python implementation still in progress so it feels a bit early to hitch our wagon to that standard. The Numpy strings, while not exactly a "specification", have been around a long time and are unlikely to change. My (loosely-held) view is that we should use Numpy strings but leave open the possiblity of adopting something different, hopefully something formally specified, in the future.

Decision: What to call the new key?

Ideas proposed:

dtype_str
dtype_numpy
dtype_zarr2
datatype

I think having both dtype (jsonschema legacy) and datatype (new thing) together would be confusing. (I would be in favor of consolidating on something like datatype in Bluesky 2.0.)

One advantage of something specific like dtype_numpy is it would let us add Zarr v3 or something else in the future unambiguously.

Status Quo

On the floor at NSLS-II, we have been using the key dtype_str and the Numpy typestr spellings. This solves the practical problem that Tiled needs to know the real data types in order to inform clients so they can pre-allocate numpy or dask arrays to download chunks of data into.

But dtype_str was never added to event-model or formally decided. The goal of this issue is to make a decision and add something to the event-model schema.

bluesky / event-model

Add specific data type #308

Decision: New key or expand `dtype` enum?

Decision: How to spell the data type?

Decision: What to call the new key?

Status Quo

bluesky / event-model

Add specific data type #308

Decision: New key or expand dtype enum?

Decision: How to spell the data type?

Decision: What to call the new key?

Status Quo

Decision: New key or expand `dtype` enum?