SainsburyWellcomeCentre / aeon_mecha

Project Aeon's main library for interfacing with acquired data. Contains modules for raw data file io, data querying, data processing, data qc, database ingestion, and building computational data pipelines.
BSD 3-Clause "New" or "Revised" License
3 stars 5 forks source link

Clashes in code terminology #293

Open jkbhagatio opened 7 months ago

jkbhagatio commented 7 months ago

Currently in Aeon terminology, I believe all of the following are referred to as 'streams':

  1. A single file type on ceph.
    (e.g. each of the Patch1_90*, CameraTop_200*, and Environment_SubjectVisits* files)

  2. A dictionary pairing a name and a Reader object.
    (e.g. the Device class has a stream attribute that contains a dict of this type of pairing that could look something like {'EnvironmentState': <aeon.io.reader.Csv object at 0x7f22a95ee650>} )

  3. A function that returns the above dictionary.
    (e.g. the functions in aeon/schema/core.py https://github.com/SainsburyWellcomeCentre/aeon_mecha/blob/main/aeon/schema/core.py)

Similarly, 'schema' refers to both:

  1. A set of 'streams' (2nd definition above) grouped into a list of Device objects within a DotMap object.

  2. A Python module that contain 'streams' (3rd definition above) (e.g. aeon/schema/core.py, the file itself)

Due to this confusion, I'd suggest renaming some of these terms:

I'd call the 2nd definition of 'stream' a 'registry' (it implies a registry of Reader objects that can be looked up by name).

I'd call the 3rd definition of 'stream' a 'binder' function (it implies creating the dictionary that binds a name to a Reader object)

I'd call the 2nd definition of 'schema' a 'reader_binder' module (it implies that the module contains 'binder' functions for Reader objects).
Subsequently, I'd move aeon/schema/ to aeon/io/reader_binder/

glopesdev commented 7 months ago

@jkbhagatio I'm not sure I follow the rationale and unless this is causing some fundamental issues I would avoid introducing breaking changes, since all the examples we have so far would stop working.

Do you have maybe a specific example of how this can be a problem? Maybe looking over some concrete code would help me to better understand the issue.

glopesdev commented 7 months ago

I mean, it is very typical in object oriented design if you have a property that is a collection of "Stream" objects, to call that property "streams".

Same for modules that manipulate entities with the same name as the module, e.g. from datetime import datetime, from scipy.fft import fft, etc.

I feel it will make the code more obscure and harder to learn if we introduce "catch-all" terms such as "registry" or "binder" which don't immediately refer to anything in Aeon terminology.

jkbhagatio commented 7 months ago

I've noticed when explaining Aeon data architecture conventions to others, current terminology has caused some confusion - specifically when the same name refers to multiple meanings, as mentioned here. I've been writing a tutorial going over Aeon data architecture which will be useful

  1. For devs as a reference
  2. For new joiners
  3. For the paper

Happy to go over this to show how using distinct terminology makes things a bit clearer for at least myself and a couple others.

To address a couple comments:

I would avoid introducing breaking changes, since all the examples we have so far would stop working.

Actually the refactoring for this in aeon_mecha and aeon_analysis was quite quick. For any other floating scripts, I imagine it wouldn't be hard to make the switch - it's just a name change on an import typically.

I mean, it is very typical in object oriented design if you have a property that is a collection of "Stream" objects, to call that property "streams".

Indeed, but in this case there is conflation between a software object and a file / data source. We typically refer to the former as a 'Reader' object, and the latter as a 'stream.' A Device object for instance contains a collection of Reader objects, so I think it would be nice to have a name that reflects this.

glopesdev commented 7 months ago

Indeed, but in this case there is conflation between a software object and a file / data source

I don't understand this issue. Files / data sources are always represented as software objects in every API that I know of. What exactly do you mean here by "software object"? I don't think we represent the file streams in conflicting ways anywhere in the aeon_mecha api, so I'm still not sure where the confusion comes from. Access to individual file streams is always performed via the streams property in the Device object or a reader object created manually, but in both cases they are all instances of the same base type (Reader).

We could in theory rename Device to DeviceReader but this would add no extra information since all of this api is about low-level file access anyway. I see the low-level API as representing exactly the folder structure that is part of the data contract, so if anything it should help that the concepts map to the same thing, since they are supposed to represent the same thing.

Happy to go over this to show how using distinct terminology makes things a bit clearer for at least myself and a couple others.

Sure, I would also be happy to go over this, if anything to ensure that everyone understands how the object model was designed and meant to be used. I feel we need to be careful of refactoring the whole API for the purposes of explaining. It's not clear to me that "streams" should be restricted to just files on disk. A video stream being acquired in Bonsai is still a "stream", even though it's not a file yet.

Maybe the confusion here is coming from conflating storage details with the concept of a stream. A stream is a sequence of data, and it can be live, frozen in a file, stored in a DB, or generated procedurally, any of these would still be a stream, it's not really a problem, and in fact being storage-independent is usually one of the most important properties of streams whenever I see them included in software data architectures.

Actually the refactoring for this in aeon_mecha and aeon_analysis was quite quick.

I think breaking old scripts is always problematic, no matter how tiny or quick the change is, and especially in a dynamic language like python. We designed this API together over quite some time and with a lot of discussion about exactly these points, so I think we should probably sit and walk through that design process again carefully to review and understand what exactly the problem is.

jkbhagatio commented 6 months ago

Linking this notebook here as relevant for this convo: https://github.com/SainsburyWellcomeCentre/aeon_mecha/blob/social_schema/docs/examples/understanding_aeon_data_architecture.ipynb