AllenNeuralDynamics / aind-data-schema

A library that defines AIND data schema and validates JSON.
MIT License
19 stars 15 forks source link

Enum-like classes cant be used as dictionary keys #960

Open bruno-f-cruz opened 3 months ago

bruno-f-cruz commented 3 months ago

Not sure if there is a different intended use...

To reproduce:

from aind_data_schema_models.platforms import Platform
from pydantic import BaseModel

class Foo(BaseModel):
    platform: dict[Platform.ONE_OF, str]

foo = Foo(platform={Platform.BEHAVIOR: "foo"})
print(foo)

bar = foo.model_dump()
bruno-f-cruz commented 3 months ago

Tracking https://github.com/pydantic/pydantic/issues/9453

bruno-f-cruz commented 3 months ago

This is actually more serious that I initially gave it credit for. This pattern completely prevents users from using the model_dump method:

import examples
from examples.fip_ophys_rig import rig
from aind_data_schema_models.modalities import Modality
rig.modalities = set([Modality.BEHAVIOR])
rig.model_dump()

raises the following error:

raceback (most recent call last):
  File ".\aind-data-schema\deserialization_order.py", line 27, in <module>     
    rig.model_dump()
  File ".\aind-data-schema\.venv\Lib\site-packages\pydantic\main.py", line 314, in model_dump
    return self.__pydantic_serializer__.to_python(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unhashable type: 'dict'
jtyoung84 commented 3 months ago

It seems like an issue with the pydantic serializer. Even this causes issues:

from typing import Set
from pydantic import BaseModel, ConfigDict
from dataclasses import dataclass

@dataclass(frozen=True)
class Cell:
    value: str

# This works
set([Cell(value="a"), Cell(value="b")])

class WeightedSeeds(BaseModel):
    seeds: Set[Cell]

foo = WeightedSeeds(seeds=set([Cell(value="a"), Cell(value="b")]))

# This raises unhashable type errors
foo.model_dump()
bruno-f-cruz commented 3 months ago

Possible solutions

1- Simplify everything and go back to native Enum. This is the preferred solution as it is rather isomorphic across programming languages and json.

2- write our own de(serializers) for a base class to be inherited from all complex Enum types. This strategy has a few problems:

3- Keep a service somewhere that takes a string and returns these complex objects. This is potentially very interesting as it would allow the schema to be decoupled from the specifics of this database. For instance, the schema would only have to keep track of all possible manufacturers without having to keep track of abbreviations, numeric IDs and other properties that might otherwise change with time.

bruno-f-cruz commented 3 months ago

@saskiad @dyf looping you in too. I think this is a rather annoying regression bug (since these started off as an Enum). Especially when other AIND tools like the watchdog could benefit greatly from this change.

I will still look into the second solution but it just doesn't feel right. We should really just make all these Enums. It would also solve the weird pattern of typing the properties with Type.ONE_OF but using a different instance (Type.Value).

jtyoung84 commented 3 months ago

Enums would be nice. I can't recall anymore exactly what the issue was that made me switch, but enums of objects were creating headaches somewhere. I do remember now that when we upgraded to v2, my original enum metadata class broke. I went with this solution. I'm open to suggestions for improvements since this has always been troublesome.

I also remember now too, that most of the docs I was reading were using tagged unions for this kind of situation, but not quite the same situation.

bruno-f-cruz commented 3 months ago

@jtyoung84 It's late and I might be missing something very obvious, but what about this?

from aind_data_schema_models.modalities import Modality

from pydantic import BaseModel, Field
from enum import Enum
from typing import Dict, Set

class Modalities(Enum):
    BEHAVIOR = Modality.BEHAVIOR
    OPHYS = Modality.POPHYS
    ICEPHYS = Modality.ICEPHYS

class SubSetOfModalities(Enum):
    BEHAVIOR = Modality.BEHAVIOR
    OPHYS = Modality.POPHYS

class Foo(BaseModel):
    modality: Dict[Modalities, str] = Field(default={Modalities.BEHAVIOR: "foo"})
    stuff: Set[Modalities] = Field(default={Modalities.BEHAVIOR, Modalities.OPHYS})
    another_subset: SubSetOfModalities = Field(default=SubSetOfModalities.BEHAVIOR, validate_default=True)

foo = Foo()
print(foo)
print(foo.model_dump())

This is just a proof of concept, we could improve a bit the subtyping I think. I think it would be so much cleaner for users too!

To be clear, I still think Enum of strings would be far easier to maintain.

bruno-f-cruz commented 3 months ago

I have been playing with this a bit more, I think we can make this change be fully backward compatible with users code. The trick would be to define the subsets of Modalities as Literals (This is the python standard in most libraries anyway, so i think it is actually a good idea...)

The previous example would thus become:

from aind_data_schema_models.modalities import Modality

from pydantic import BaseModel, Field
from enum import Enum
from typing import Dict, Set, Literal

class Modalities(Enum):
    BEHAVIOR = Modality.BEHAVIOR
    OPHYS = Modality.POPHYS
    ICEPHYS = Modality.ICEPHYS

class Foo(BaseModel):
    modality: Dict[Modalities, str] = Field(default={Modalities.BEHAVIOR: "foo"})
    stuff: Set[Modalities] = Field(default={Modalities.BEHAVIOR, Modalities.OPHYS})
    another_subset: Literal[Modalities.BEHAVIOR, Modalities.OPHYS] = Field(..., validate_default=True)

foo = Foo(another_subset=Modalities.BEHAVIOR)  # type hint: "another_subset: Literal[Modalities.BEHAVIOR, Modalities.OPHYS]"
print(foo)
print(foo.model_dump())
bruno-f-cruz commented 3 months ago

Ignore the previous two suggestions. I dont think we want to go down this route as both the json-schema and round-trip (de)serialization breaks.

print(foo.model_validate_json(foo.model_dump_json()))

leads to:

pydantic_core._pydantic_core.ValidationError: 4 validation errors for Foo
modality.name='Behavior' abbreviation='behavior'.[key]
  Input should be Behavior(name='Behavior', abbreviation='behavior'), POphys(name='Planar optical physiology', abbreviation='ophys') or Icephys(name='Intracellular electrophysiology', abbreviation='icephys') [type=enum, input_value="name='Behavior' abbreviation='behavior'", input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/enum
stuff.0
  Input should be Behavior(name='Behavior', abbreviation='behavior'), POphys(name='Planar optical physiology', abbreviation='ophys') or Icephys(name='Intracellular electrophysiology', abbreviation='icephys') [type=enum, input_value={'name': 'Planar optical ...'abbreviation': 'ophys'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/enum
stuff.1
  Input should be Behavior(name='Behavior', abbreviation='behavior'), POphys(name='Planar optical physiology', abbreviation='ophys') or Icephys(name='Intracellular electrophysiology', abbreviation='icephys') [type=enum, input_value={'name': 'Behavior', 'abbreviation': 'behavior'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/enum
another_subset
  Input should be <Modalities.BEHAVIOR: Behavior(name='Behavior', abbreviation='behavior')> or <Modalities.OPHYS: POphys(name='Planar optical physiology', abbreviation='ophys')> [type=literal_error, input_value={'name': 'Behavior', 'abbreviation': 'behavior'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/literal_error

It seems pydantic cant automatically convert the json object instance to the corresponding python class. This could be done with an extra validator, but I still think we should go with Enum of type string to keep the solution simple.