bluesky / event-model

data model for event-based data collection and analysis
https://blueskyproject.io/event-model
BSD 3-Clause "New" or "Revised" License
13 stars 30 forks source link

Add detailed data types #215

Open tacaswell opened 2 years ago

tacaswell commented 2 years ago

Description

This is, modulo a massive and possibly un-needed re-organization and some bug-fixes (the first 5 commits are not strictly needed for this work), the implementation of #214 .

The consensus I reached in #214 is to use the numpy dtype.str and dtype.descr as 2 additional keys which gives us enough information to identify both "built in" types and structured types using a pre-existing scheme. This was picked over the PEP3118 string formatting due to the wider adoption and better documentation of the numpy scheme over the pep scheme. 2 keys was chosen over 1 key of variably type to avoid the type instability. There may be a case that the descr field should be extra optional (we must have 'dtype', we may have a 'dtype_str' and if we have a 'dtype_str' we may also have a 'dtype_descr').

The rules for getting back to the numpy dtype is:

which is fiddly, but I think acceptable. It may be possible to get more inside the head of np.dtype and pass some function in numpy both the str and the descr and let it sort things out, but I have not found that function yet.

There is more information in the __aray_protorol__ bundle, like the offsets or padding, that we are not capturing here because that is a hardware dependent detail and not machine invariant structure. That is, from the point of view of the event model [('a', 'u1'), ('b', 'f8')] with the float align to the byte boundary or to the 8 byte boundary are "the same". Describing the exact in-memory layout should be left to a library (like tiled!) that handles serialization / communication between processes.

Related, given the above discussion one could argue that we should be dropping the endianness of the data (as that is the poster-child for machine dependent details!), but I think the cost of carrying around a bit of "too detailed" information is an acceptable cost of not having to invent and describe a variation on the numpy scheme that ignores the endianness.

Motivation and Context

Closes #214

How Has This Been Tested?

Docs

Need to edit and migrate my ranting it #214 to the docs.

cross project work