AllenNeuralDynamics / aind-data-schema

A library that defines AIND data schema and validates JSON.
MIT License
23 stars 16 forks source link

Lists vs Dictionaries when using `List[Model]` #609

Open bruno-f-cruz opened 11 months ago

bruno-f-cruz commented 11 months ago

A common pattern in the current schema is to use a List of generic objects to keep track of such objects. For instance, cameras can be expressed as follows:

https://github.com/AllenNeuralDynamics/aind-data-schema/blob/c8aca0341d86efc82aa32ba0255b1f66287a4a3c/src/aind_data_schema/rig.py#L64C1

While easy to use, Lists make grabbing a specific device somewhat difficult without parsing all objects since they must be indexed by the order (which can easily change across days) or the experimenter must generate somesort of keyed collection type with an unique key (e.g. device name).

I wonder if rather than using generic unsorted lists, it would be beneficial to simply use some sort of hashed dictionary structure in the form of Dict[str, CameraAssembly] instead of List[CameraAssembly]. This would make it much easier to access the objects after deserialization.

bruno-f-cruz commented 11 months ago

Something along these lines:

https://github.com/pydantic/pydantic/issues/726#issuecomment-519276108

bruno-f-cruz commented 10 months ago

Relatedly, any specific reason for the name field being optional?

https://github.com/AllenNeuralDynamics/aind-data-schema/blob/e92f039d1390b13e6a95f0b793d8fdc0a131a77e/src/aind_data_schema/device.py#L93C1-L94C1

This creates a situation where even if I wanted to create the dictionary post-hoc, using the array of objects, not only do I have to deal with potential duplicated keys (which it would still be important to solve), but I also have to deal with objects without a name field.

I guess this my be related to #620

saskiad commented 10 months ago

yep, we already have a ticket to fix this.

bruno-f-cruz commented 10 months ago

If you have a chance, can you also comment on the original issue? And why Lists > Dict at this point? Just want to understand if it is something we want to think about, or simply drop it. Thanks!

saskiad commented 10 months ago

we'll discuss it when we've finished the v2 update

bruno-f-cruz commented 9 months ago

Can we revisit this discussion now that v2 has been merged? I am at a point where having this implemented would save us a lot of headaches down the road when instantiating hardware from the schema. Thanks!

bruno-f-cruz commented 9 months ago

I made a quick test of what this could look like here:

https://github.com/AllenNeuralDynamics/aind-data-schema/pull/704

I used additional_devices a test case. The proposed solution is backward compatible I think

dyf commented 9 months ago

I'm uncomfortable with making the name implicit as a key, and then having to replicate it inside the field itself, but only sometimes. Is the right solution just to make name required? @bruno-f-cruz @saskiad

bruno-f-cruz commented 9 months ago

To clarify, I don't think the name must be the same as the key. However, for backward compatibility, I decided to use the name as a key when constructing the dictionary so people could migrate from List->Dict painlessly. If people define it as a Key:Value pair directly, I don't see why name must be the key. I would be happy to consider other default conversations from List->Dict

bruno-f-cruz commented 8 months ago

After a bit of playing around with the schema, I really think that moving to dictionaries is going to make a lot of the schema grammar much simpler.

For instance, currently in order to pass a DaqChannel to the Wheel class, one must know the index of that channel in the device. This creates a huge problem as the order of devices in the array is not stable and can easily change if a channel is deleted. Moreover, it feels a bit clunky to have to define the device name in each channel since, if a channel belongs to a device, the name of the device should be a given.


import os
import aind_data_schema.models.devices as d
from aind_data_schema.models.manufacturers import Manufacturer

wheel_encoder_channel = d.DAQChannel(
    device_name='BehaviorBoard',
    channel_name='DI0', #This should be an array
    channel_type=d.DaqChannelType.DI,
    event_based_sampling=True,
)

wheel_break_channel = d.DAQChannel(
    device_name='BehaviorBoard',
    channel_name='DO0',
    channel_type=d.DaqChannelType.DO,
    event_based_sampling=True,
)

wheel_torque_channel = d.DAQChannel(
    device_name='BehaviorBoard',
    channel_name='AI0',
    channel_type=d.DaqChannelType.AI,
    event_based_sampling=True,
)

harp_board = d.HarpDevice(
    name='BehaviorBoard',
    serial_number=None,
    # manufacturer=Manufacturer.CHAMPALIMAUD, # this crashes
    data_interface=d.DataInterface.USB,
    computer_name=os.getenv('COMPUTERNAME'),
    channels=[wheel_encoder_channel, wheel_break_channel, wheel_torque_channel],
    harp_device_type=d.HarpDeviceType.BEHAVIOR,
    is_clock_generator=False,
    port_index='COM3')

wheel = d.Wheel(
    name='ThisWheel',
    encoder=harp_board,
    encoder_output=harp_board.channels[0],
    magnetic_brake=harp_board,
    brake_output=harp_board.channels[1],
    torque_sensor=harp_board,
    torque_output=harp_board.channels[2],
    radius=0,
    width=0,
    pulse_per_revolution=0,
)

My suggestion for a new pattern includes:

  1. Make the List of channels a Dict. This circumvents two problems. a) Ensures that all entries are unique; b) Allows stable indexing by key
  2. Adding channels to the Dict is done through helper methods (add and remove). Ideally, the setter for the property would also be disabled. This ensures that users cannot simply modify the property without validation. Read more about this pattern here.
  3. Delete the reference to the device from peripherals (such as Wheel, since the DAQ channel already has this valid reference.

The aforementioned pattern would thus become:

import os
from typing import Dict, Literal, Union, List

import aind_data_schema.models.devices as d
from pydantic import Field
from aind_data_schema.models.manufacturers import Manufacturer
from aind_data_schema.models.units import SizeUnit

class NewHarpDevice(d.HarpDevice):
    channels: Dict[str, d.DAQChannel] = Field(default_factory=dict)

    def add_channel(self, channel: Union[d.DAQChannel, List[d.DAQChannel]]):
        if isinstance(channel, list):
            for c in channel:
                self.add_channel(c)
        else:
            channel = self._validate_daq_channel(channel)
            self.channels[channel.channel_name] = channel

    def remove_channel(self, channel_name: str):
        if channel_name in self.channels:
            del self.channels[channel_name]
        else:
            raise ValueError(f"Channel {channel_name} does not exist")

    def _validate_daq_channel(self, channel: d.DAQChannel) -> d.DAQChannel:
        if not channel.device_name:
            channel.device_name = self.name
        if channel.device_name != self.name:
            raise ValueError(f"Device name {channel.device_name} does not match {self.name}")
        if channel.channel_name in self.channels:
            raise ValueError(f"Channel {channel.channel_name} already exists")
        return channel

class NewWheel(d.MousePlatform):
    device_type: Literal["Wheel"] = "Wheel"
    radius: float = Field(..., title="Radius (mm)")
    width: float = Field(..., title="Width (mm)")
    size_unit: SizeUnit = Field(SizeUnit.MM, title="Size unit")
    encoder_output: d.DAQChannel = Field(None, title="Encoder DAQ channel")
    pulse_per_revolution: int = Field(..., title="Pulse per revolution")
    brake_output: d.DAQChannel = Field(None, title="Brake DAQ channel")
    torque_output: d.DAQChannel = Field(None, title="Torque DAQ channel")

wheel_encoder_channel = d.DAQChannel(
    device_name='',
    channel_name='DI0', #This should be an array
    channel_type=d.DaqChannelType.DI,
    event_based_sampling=True,
)

wheel_break_channel = d.DAQChannel(
    device_name='',
    channel_name='DO0',
    channel_type=d.DaqChannelType.DO,
    event_based_sampling=True,
)

wheel_torque_channel = d.DAQChannel(
    device_name='',
    channel_name='AI0',
    channel_type=d.DaqChannelType.AI,
    event_based_sampling=True,
)

harp_board = NewHarpDevice(
    name='BehaviorBoard',
    serial_number=None,
    # manufacturer=Manufacturer.CHAMPALIMAUD, # this crashes
    data_interface=d.DataInterface.USB,
    computer_name=os.getenv('COMPUTERNAME'),
    harp_device_type=d.HarpDeviceType.BEHAVIOR,
    is_clock_generator=False,
    port_index='COM3')

harp_board.add_channel(wheel_encoder_channel) # Notice that once this method runs, it modifies the original object with the name of the DaqDevice.
harp_board.add_channel(wheel_break_channel)
harp_board.add_channel(wheel_torque_channel)

new_wheel = NewWheel(
    name='ThisWheel',
    radius=0,
    width=0,
    pulse_per_revolution=0,
    encoder_output=wheel_encoder_channel, # just to show that the object is modified and can be used as a reference
    brake_output=harp_board.channels['DO0'],
    torque_output=harp_board.channels['AI0'],
)

Finally, I would strongly advise to adopt a name for all these objects. It is a bit weird that some have name, while others have an identical property but called channel_name or assembly_name.