Open bruno-f-cruz opened 3 months ago
This is actually more serious that I initially gave it credit for. This pattern completely prevents users from using the model_dump
method:
import examples
from examples.fip_ophys_rig import rig
from aind_data_schema_models.modalities import Modality
rig.modalities = set([Modality.BEHAVIOR])
rig.model_dump()
raises the following error:
raceback (most recent call last):
File ".\aind-data-schema\deserialization_order.py", line 27, in <module>
rig.model_dump()
File ".\aind-data-schema\.venv\Lib\site-packages\pydantic\main.py", line 314, in model_dump
return self.__pydantic_serializer__.to_python(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unhashable type: 'dict'
It seems like an issue with the pydantic serializer. Even this causes issues:
from typing import Set
from pydantic import BaseModel, ConfigDict
from dataclasses import dataclass
@dataclass(frozen=True)
class Cell:
value: str
# This works
set([Cell(value="a"), Cell(value="b")])
class WeightedSeeds(BaseModel):
seeds: Set[Cell]
foo = WeightedSeeds(seeds=set([Cell(value="a"), Cell(value="b")]))
# This raises unhashable type errors
foo.model_dump()
Possible solutions
1- Simplify everything and go back to native Enum
. This is the preferred solution as it is rather isomorphic across programming languages and json.
2- write our own de(serializers) for a base class to be inherited from all complex Enum types. This strategy has a few problems:
name
property`) and deserialize by looking into the allowed types and finding the correct class to deserialize to. This means that all types must be discriminated unions, with the discriminator being the same as the property being serialized.Literal
and read-only (i.e. provided at construction type). Otherwise any sort of custom deserialization will inevitably be lossy;3- Keep a service somewhere that takes a string and returns these complex objects. This is potentially very interesting as it would allow the schema to be decoupled from the specifics of this database. For instance, the schema would only have to keep track of all possible manufacturers without having to keep track of abbreviations, numeric IDs and other properties that might otherwise change with time.
@saskiad @dyf looping you in too. I think this is a rather annoying regression bug (since these started off as an Enum). Especially when other AIND tools like the watchdog could benefit greatly from this change.
I will still look into the second solution but it just doesn't feel right. We should really just make all these Enums. It would also solve the weird pattern of typing the properties with Type.ONE_OF but using a different instance (Type.Value).
Enums would be nice. I can't recall anymore exactly what the issue was that made me switch, but enums of objects were creating headaches somewhere. I do remember now that when we upgraded to v2, my original enum metadata class broke. I went with this solution. I'm open to suggestions for improvements since this has always been troublesome.
I also remember now too, that most of the docs I was reading were using tagged unions for this kind of situation, but not quite the same situation.
@jtyoung84 It's late and I might be missing something very obvious, but what about this?
from aind_data_schema_models.modalities import Modality
from pydantic import BaseModel, Field
from enum import Enum
from typing import Dict, Set
class Modalities(Enum):
BEHAVIOR = Modality.BEHAVIOR
OPHYS = Modality.POPHYS
ICEPHYS = Modality.ICEPHYS
class SubSetOfModalities(Enum):
BEHAVIOR = Modality.BEHAVIOR
OPHYS = Modality.POPHYS
class Foo(BaseModel):
modality: Dict[Modalities, str] = Field(default={Modalities.BEHAVIOR: "foo"})
stuff: Set[Modalities] = Field(default={Modalities.BEHAVIOR, Modalities.OPHYS})
another_subset: SubSetOfModalities = Field(default=SubSetOfModalities.BEHAVIOR, validate_default=True)
foo = Foo()
print(foo)
print(foo.model_dump())
This is just a proof of concept, we could improve a bit the subtyping I think. I think it would be so much cleaner for users too!
To be clear, I still think Enum of strings would be far easier to maintain.
I have been playing with this a bit more, I think we can make this change be fully backward compatible with users code. The trick would be to define the subsets of Modalities as Literals (This is the python standard in most libraries anyway, so i think it is actually a good idea...)
The previous example would thus become:
from aind_data_schema_models.modalities import Modality
from pydantic import BaseModel, Field
from enum import Enum
from typing import Dict, Set, Literal
class Modalities(Enum):
BEHAVIOR = Modality.BEHAVIOR
OPHYS = Modality.POPHYS
ICEPHYS = Modality.ICEPHYS
class Foo(BaseModel):
modality: Dict[Modalities, str] = Field(default={Modalities.BEHAVIOR: "foo"})
stuff: Set[Modalities] = Field(default={Modalities.BEHAVIOR, Modalities.OPHYS})
another_subset: Literal[Modalities.BEHAVIOR, Modalities.OPHYS] = Field(..., validate_default=True)
foo = Foo(another_subset=Modalities.BEHAVIOR) # type hint: "another_subset: Literal[Modalities.BEHAVIOR, Modalities.OPHYS]"
print(foo)
print(foo.model_dump())
Ignore the previous two suggestions. I dont think we want to go down this route as both the json-schema and round-trip (de)serialization breaks.
print(foo.model_validate_json(foo.model_dump_json()))
leads to:
pydantic_core._pydantic_core.ValidationError: 4 validation errors for Foo
modality.name='Behavior' abbreviation='behavior'.[key]
Input should be Behavior(name='Behavior', abbreviation='behavior'), POphys(name='Planar optical physiology', abbreviation='ophys') or Icephys(name='Intracellular electrophysiology', abbreviation='icephys') [type=enum, input_value="name='Behavior' abbreviation='behavior'", input_type=str]
For further information visit https://errors.pydantic.dev/2.7/v/enum
stuff.0
Input should be Behavior(name='Behavior', abbreviation='behavior'), POphys(name='Planar optical physiology', abbreviation='ophys') or Icephys(name='Intracellular electrophysiology', abbreviation='icephys') [type=enum, input_value={'name': 'Planar optical ...'abbreviation': 'ophys'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.7/v/enum
stuff.1
Input should be Behavior(name='Behavior', abbreviation='behavior'), POphys(name='Planar optical physiology', abbreviation='ophys') or Icephys(name='Intracellular electrophysiology', abbreviation='icephys') [type=enum, input_value={'name': 'Behavior', 'abbreviation': 'behavior'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.7/v/enum
another_subset
Input should be <Modalities.BEHAVIOR: Behavior(name='Behavior', abbreviation='behavior')> or <Modalities.OPHYS: POphys(name='Planar optical physiology', abbreviation='ophys')> [type=literal_error, input_value={'name': 'Behavior', 'abbreviation': 'behavior'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.7/v/literal_error
It seems pydantic cant automatically convert the json object instance to the corresponding python class. This could be done with an extra validator, but I still think we should go with Enum of type string to keep the solution simple.
Not sure if there is a different intended use...
To reproduce: