bluesky / event-model

data model for event-based data collection and analysis
https://blueskyproject.io/event-model
BSD 3-Clause "New" or "Revised" License
15 stars 31 forks source link

Add specific data type #308

Open danielballan opened 4 months ago

danielballan commented 4 months ago

EventDescriptors specific that data type of the data in Events and StreamResources in the key dtype using jsonschema datatypes:

https://github.com/bluesky/event-model/blob/145118722a2e2f15abf32010e3e6b71506398140/event_model/schemas/event_descriptor.json#L47-L53

This is the tragedy of, "You must define the core of your software at the beginning, when you understand the problem the least." At the time (2015) we were focused on MongoDB (bson) and Python applications, where data types can be coarsely defined. We now view this as a mistake. The array option is particular does not make sense: we have shape for that. We should have given specific types.

How should we add them now?

Decision: New key or expand dtype enum?

If we expand the dtype enum to optionally specify a specific data type instead of the jsonschema types, this could break downstream consumers (some in code that we do not know about) that have been able to expect the jsonschema types for the last ~8 years. It seems safest to add a new key sitting beside dtype. In "Bluesky 2.0" this could be cleaned up / consolidated and documented as a backward-incompatible change.

Decision: How to spell the data type?

Three ideas have been proposed:

  1. Use the NumPy array protocol type string (typestr) format, e.g. "<f8", ">i4", "|b1". There is precedent for using this as a way of encoding numpy data types in JSON: the Zarr v2 spec does so.
  2. Use the newer Zarr v3 specification, which opts for a more constrained set of supported types with more human-readable names, e.g. float64, int8, bool. Types are little-endian. Big-endianness is a handled as a property of the encoding (a codec).
  3. Use Arrow, which supports a super-set of these types. However, Arrow has no officially-supported JSON encoding. Its schema is binary; it would be have to be base64-encoded or similar---not human-readable. For that reason, I think it is easy to reject this option.

As of May 9, Zarr v3 is still just a specification, with a Python implementation still in progress so it feels a bit early to hitch our wagon to that standard. The Numpy strings, while not exactly a "specification", have been around a long time and are unlikely to change. My (loosely-held) view is that we should use Numpy strings but leave open the possiblity of adopting something different, hopefully something formally specified, in the future.

Decision: What to call the new key?

Ideas proposed:

I think having both dtype (jsonschema legacy) and datatype (new thing) together would be confusing. (I would be in favor of consolidating on something like datatype in Bluesky 2.0.)

One advantage of something specific like dtype_numpy is it would let us add Zarr v3 or something else in the future unambiguously.

Status Quo

On the floor at NSLS-II, we have been using the key dtype_str and the Numpy typestr spellings. This solves the practical problem that Tiled needs to know the real data types in order to inform clients so they can pre-allocate numpy or dask arrays to download chunks of data into.

But dtype_str was never added to event-model or formally decided. The goal of this issue is to make a decision and add something to the event-model schema.

coretl commented 4 months ago

Suggest dtype_numpy, and have str as the json schema type. This means that if you put garbage in then the json schema says everything is fine, then something downstream might fall over and you will get no early warning, but that is probably ok