dapper91 / pydantic-xml

python xml for humans
https://pydantic-xml.readthedocs.io
The Unlicense
141 stars 14 forks source link

Cannot serialize non-generic model with generic model inside #159

Open ajoino opened 5 months ago

ajoino commented 5 months ago

You currently cannot serialize a non-generic model with a generic model inside of it. While to_xml() doesn't work, model_dump_json() still works, so it doesn't seem to be a inherent problem with serialization, rather a limitation of the current xml serialization implementation.

Example model:

from pydantic_xml import BaseXmlObject, element
from Typing import Generic, TypeVar

T = TypeVar("T")

class GenericModel(BaseXmlModel, Generic[T]):
    generic: T_0 = element("GenericElement")

class NonGenericWrapper(BaseXmlModel):
    wrapped_generic: GenericModel = element("WrappedGeneric")

Testing model_dump_json() and to_xml()

>>> NonGenericWrapper(wrapped_generic=GenericModel[int](generic=10)).model_dump_json()
'{"wrapped_generic":{"generic":10}}'
>>> NonGenericWrapper(wrapped_generic=GenericModel[int](generic=10)).to_xml()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/.local/lib/python3.10/site-packages/pydantic_xml/model.py", line 399, in to_xml
    return etree.tostring(self.to_xml_tree(skip_empty=skip_empty), **kwargs)
  File "/root/.local/lib/python3.10/site-packages/pydantic_xml/model.py", line 380, in to_xml_tree
    self.__xml_serializer__.serialize(
  File "/root/.local/lib/python3.10/site-packages/pydantic_xml/serializers/factories/model.py", line 170, in serialize
    field_serializer.serialize(
  File "/root/.local/lib/python3.10/site-packages/pydantic_xml/serializers/factories/model.py", line 366, in serialize
    assert self._model.__xml_serializer__ is not None, f"model {self._model.__name__} is partially initialized"
AssertionError: model GenericModel is partially initialized

I think this limitation occurs because the serialization was changed to no longer accept the Any type, which in many cases can be replaced with raw fields, but in many cases it shouldn't have to be. The reason I can't make the NonGenericWrapper generic is because in my use case the NonGenericWrapper.wrapped_generic field is actually a sequence of elements each of which can have a separate generic instantiation, thus requiring generic variadics (which might still not work because the model is quite complex).

I think the choice to reject serialization of Any should be reconsidered, at least if we want parity with core pydantic serialization. Although I realize that the removal of Any serialization was probably done for a good reason and might be complicated to implement.

dapper91 commented 5 months ago

@ajoino Hi,

Due to xml complexity the model xml serializer/deserializer is built during model definition, not model instantiation that is why you get the error: model GenericModel is partially initialized. So any generic model arguments should be defined before the first instantiation.

The problem with Any is that it is not obvious which type will be behind it. It could be primitive type (and that ok), but it also could be model, collection or raw type and that brings us back to the first problem: the serializer is build during model definition, but we don't know the exact type. That's why raw elements should be used instead of it.

Considering your problem, do you know the wrapped_generic sequence allowed types beforehand? Maybe Union can solve your problem:

>>> from pydantic_xml import BaseXmlModel, element
>>> from typing import Generic, TypeVar, List, Union
>>>
>>> T = TypeVar("T")
>>>
>>> class GenericModel(BaseXmlModel, Generic[T]):
...     generic: T = element("GenericElement")
...
>>> class NonGenericWrapper(BaseXmlModel):
...     wrapped_generic: List[Union[
...         GenericModel[int],
...         GenericModel[float],
...         GenericModel[str],
...     ]] = element("WrappedGeneric")
...
>>> wrp = NonGenericWrapper(
...     wrapped_generic=[
...         GenericModel[int](generic=10),
...         GenericModel[float](generic=10.5),
...         GenericModel[str](generic='text'),
...     ]
... )
>>> print(wrp.to_xml(pretty_print=True).decode())
<NonGenericWrapper>
  <WrappedGeneric>
    <GenericElement>10</GenericElement>
  </WrappedGeneric>
  <WrappedGeneric>
    <GenericElement>10.5</GenericElement>
  </WrappedGeneric>
  <WrappedGeneric>
    <GenericElement>text</GenericElement>
  </WrappedGeneric>
</NonGenericWrapper>
ajoino commented 5 months ago

I've slept on this and I think I was perhaps a bit hasty suggesting that the current serialization scheme should be revisited, and I should have instead tried to approach this differently. I have created a union type in Python that covers the xsd:anySimpleType type, but I'm still struggling a bit with the xsd:anyType and the dreaded xsd:any. The problem with these two is that if I have an element like this

<xsd:complexType name="ContainsAnyStuff">
    <xsd:sequence>
        <xsd:element name="ThisIsProblematic" type="xsd:anyType" minOccurs="1" maxOccurs="unbounded"/>
    </xsd:sequence>
    <xsd:any namespace="##any"/>
</xsd:complexType>

I don't even know what tag will be used for the xsd:any elements, so pydantic-xml cannot model these element currently AFAIK. As you've said, the ThisIsProblematic element can probably we well covered by a combination of raw elements and type unions.

Regarding xsd:any elements, to model such elements, we need some sort of escape hatch that will capture all elements of an xml document that are not explicitly part of the pydantic-xml model, here's one possible API:

class CaptureUnassignedElements(BaseXmlModel, allow_arbitrary_types=True, capture_unassigned_elements=True):
    test: int = element("Test")

input_str = """
    <CaptureUnassignedElements>
        <test>23</test>
        <abc123>test</abc123>
        <empty/>
    </CaptureUnassignedElements>
"""

cua_test = CaptureUnassignedElements.from_xml(input_str)

print(cua_test.test)  # prints 23
print(cua_test.any_)  # print [lxml.etree._Element(...), lxml.etree._Element(...)] or something similar

Essentially, by giving a model a specific flag at construction it will capture all elements that aren't in the model description and make the accessable through a special attribute, name TBD. In the proposal above it's a list, but it could perhaps be an etree.Element/etree._Element that contains the captured elements. It can be further restricted by requiring that the xsd:any elements come last in a xsd:sequence to make it easier for the parser, that is at least how I've seen it used the majority of the time.

What do you think about this proposal?

dapper91 commented 3 months ago

@ajoino Hi,

I am not sure defining any_ phantom field is a very intuitive approach. Furthermore user can define any_ field in a model by himself which could lead to some non-obvious errors. Maybe defining a special method to handle unbound xml entities will be more intuitive.

What do you think about this:

class CaptureUnassignedElements(BaseXmlMode):
    test: int = element()

    @unbound_handler()
    def handle_unbound(self, text: Optional[str], attrs: Optional[Dict[str, str]], elements: List[ElementT]) -> None:
        """
        :param text: element text if unbound
        :param attrs: unbound element attributes
        :param elements: unbound sub-elements
        """

        assert elements[0].tag == 'abc123'
        assert elements[0].text == 'test'

        assert elements[1].tag == 'empty'