dapper91 / pydantic-xml

python xml for humans
https://pydantic-xml.readthedocs.io
The Unlicense
141 stars 14 forks source link

Repeating Collections as Tuple[MyType, ...] not working as expected #128

Closed skewty closed 8 months ago

skewty commented 9 months ago

senderdata will not parse correctly.

<request>
    <senderdata>
        <address>400</address>
        <name></name>
        <address>401</address>
        <name>Bob Andersen</name>
        <address>402</address>
        <name>Hansi</name>
        <address>403</address>
        <name>George Lucas</name>
        <address>404</address>
        <name>Michael Jensen</name>
        <address>406</address>
        <name>406</name>
        <address>407</address>
        <name>Fenger</name>
        <address>408</address>
        <name>408</name>
        <address>410</address>
        <name>410</name>
    </senderdata>
</request>
class FooPersonData(BaseXmlModel):
    address: str = element(tag="address", default="")
    name: str = element(tag="name", default="")
    # there are other optional fields as well
    # number of fields provided is unknown

class FooRequest(BaseXmlModel, tag="request", search_mode="unordered"):
    sender_data: tuple[FooPersonData, ...] | None = element(tag="senderdata", default=None)
dapper91 commented 9 months ago

@skewty Hi

Your model is incorrect. Sub-model is bound to the entire sub-element. More information here. In your example the first FooPersonData is bound to the firstsenderdata sub-element. The next FooPersonData's in the tuple are bound to the following senderdata's.

In other words your model is defined for the following document:

<request>
    <senderdata>
        <address>400</address>
        <name></name>
    </senderdata>
    <senderdata>
        <address>401</address>
        <name>Bob Andersen</name>
    <senderdata>
        <address>402</address>
        <name>Hansi</name>
    </senderdata>
   ...
</request>
skewty commented 9 months ago

Thanks for the detailed response.

Is it possible to use python-xml to get it to work with the data above?

I looked and didn't see an equivalent to pydantic's RootModel in python-xml which in my mind would be something like: __root__: tuple[FooPersonData, ...].

skewty commented 9 months ago
class FooPersonData(BaseXmlModel, tag="persondata", search_mode="unordered"):
    address: tuple[str, ...] | None = element(tag="address", default=None)
    name: tuple[str, ...] | None = element(tag="name", default=None)

class FooRequest(BaseXmlModel, tag="request", search_mode="unordered"):
    sender_data: FooPersonData | None = element(tag="senderdata", default=None)

Is the best I can come up with and then "zip" them together myself. The model.to_xml() output is obviously different in this case.

skewty commented 9 months ago
request = FooRequest.from_xml("""<request>
    <senderdata>
        <address>400</address>
        <name></name>
        <address>401</address>
        <name>Bob Andersen</name>
        <address>402</address>
        <name>Hansi</name>
        <address>403</address>
        <name>George Lucas</name>
        <address>404</address>
        <name>Michael Jensen</name>
        <address>406</address>
        <name>406</name>
        <address>407</address>
        <name>Fenger</name>
        <address>408</address>
        <name>408</name>
        <address>410</address>
        <name>410</name>
    </senderdata>
</request>
""")
for name, address in zip(request.sender_data.name, request.sender_data.address):
    print(f"name={name!r} address={address!r}")

Gives:

name='Michael Jensen' address='400'
name='Hansi' address='401'
name='406' address='402'
name='Bob Andersen' address='403'
name='Fenger' address='404'
name='George Lucas' address='406'
name='408' address='407'

which is wrong since I need empty name to match with address 400. This throws all the rest off.

dapper91 commented 9 months ago

Came up with that:

from pydantic_xml import BaseXmlModel, element, wrapped

class Address(BaseXmlModel, tag='address'):
    value: str

class Name(BaseXmlModel, tag='name'):
    value: str = ""

class FooPersonData(BaseXmlModel):
    address: str = element(tag="address", default="")
    name: str = element(tag="name", default="")

class FooRequest(BaseXmlModel, tag='request'):
    sender_data_raw: tuple[Address | Name, ...] = wrapped(path="senderdata", default=None)

    @property
    def sender_data(self) -> tuple[FooPersonData, ...] | None:
        if self.sender_data_raw is not None:
            return tuple((
                FooPersonData(address=address.value, name=name.value)
                for address, name in zip(self.sender_data_raw[0::2], self.sender_data_raw[1::2])
            ))
skewty commented 9 months ago

That's awesome! How can I buy you a coffee / tea or something for your effort?

skewty commented 8 months ago

Coming back to this so I can use pydantic v2 and pydantic-xml in production.

I think there is an issue in serialization in pydantic-xml because:

class PersonDataDef(BaseXmlModel, populate_by_name=True, skip_empty=False):
    address: Annotated[str | None, element(tag="address", default=None)]
    name: Annotated[str | None, element(tag="name", default=None)]
    location: Annotated[str | None, element(tag="location", default=None)]
    status: Annotated[int | None, element(tag="status", default=None)]
    status_info: Annotated[str | None, element(tag="statusinfo", default=None)]

    @model_validator(mode="before")
    @classmethod
    def _watch_out_for_nones(cls, values: dict) -> dict:
        return values  # the dict here doesn't contain keys+value pairs for status nor statusinfo

but when I go to serialize using output_xml = request.to_xml(skip_empty=True)

I am seeing:

<senderdata><address>991</address><name>991</name><location>SME VoIP</location><status>None</status><statusinfo>None</statusinfo></senderdata>

in the output. This <status>None</status><statusinfo>None</statusinfo> is incorrect / invalid.

skewty commented 8 months ago

I was trying for a cleaner solution that what you have above. Your solution above didn't serialize correctly in production so I needed to solve that too.

This is the approach I was working with before I stopped / got stuck:

class SenderDataDef(RootXmlModel, tag="senderdata"):
    root: tuple[PersonDataDef, ...]

    @classmethod
    def __build_serializer__(cls) -> None:
        super().__build_serializer__()
        patched_deserialize = partial(cls._deserialize, cls.__xml_serializer__)
        setattr(cls.__xml_serializer__, "deserialize", patched_deserialize)

    @classmethod
    def _deserialize(cls, self: ModelSerializer, element: XmlElementReader | None, *, context: dict[str, Any] | None) -> BaseXmlModel | None:
        # actual_xml = """<senderdata>
        #     <address>400</address>
        #     <name></name>
        #     <address>401</address>
        #     <name>Bob Andersen</name>
        #     <address>402</address>
        #     <name>Hansi</name>
        #     <address>403</address>
        #     <name>George Lucas</name>
        #     <address>404</address>
        #     <name>Michael Jensen</name>
        #     <address>406</address>
        #     <name>406</name>
        #     <address>407</address>
        #     <name>Fenger</name>
        #     <address>408</address>
        #     <name>408</name>
        #     <address>410</address>
        #     <name>410</name>
        # </senderdata>"""
        items = [
            {"address": "400", "name": ""},
            {"address": "401", "name": "Bob Andersen"},
            {"address": "402", "name": "Hansi"},
            {"address": "403", "name": "George Lucas"},
            {"address": "404", "name": "Michael Jensen"},
            {"address": "406", "name": "406"},
            {"address": "407", "name": "Fenger"},
            {"address": "408", "name": "408"},
            {"address": "410", "name": "410"},
        ]  # is it actually possible to get this information out of element? I tried and couldn't figure it out
        result = tuple(PersonDataDef.model_validate(item) for item in items)
        return self._model.model_validate(result, strict=False, context=context)

but as you can see in the comment, I wasn't able to extract enough data from element. I was expecting to get the raw ElementTree object here so I could get all the children but couldn't figure it out.

If I got deserialize working, I was going to use a similar approach to serialize.

dapper91 commented 8 months ago

@skewty Hi

Answering the first question, PersonDataDef model override skip_empty flag passed to to_xml. So remove it from the model

class PersonDataDef(BaseXmlModel, populate_by_name=True):
    address: Annotated[str | None, element(tag="address", default=None)]
    name: Annotated[str | None, element(tag="name", default=None)]
    location: Annotated[str | None, element(tag="location", default=None)]
    status: Annotated[int | None, element(tag="status", default=None)]
    status_info: Annotated[str | None, element(tag="statusinfo", default=None)]

or define your own None serialization format.

dapper91 commented 8 months ago

I was expecting to get the raw ElementTree object here so I could get all the children but couldn't figure it out.

You can get the raw ElementTree from the XmlElementReader and iterate over all sub-elements:

element_iter = element.to_native()
items = [
    {
        'address': addr.text,
        'name': name.text,
    }
    for addr, name in zip(element_iter[0::2], element_iter[1::2])
]
skewty commented 8 months ago

Not sure I how I didn't figure that out when I tried similar myself. Anyway, your very helpful assistance quickly lead to this solution:

class PersonDataDef(BaseXmlModel, populate_by_name=True, skip_empty=False):
    # # address field may have type= attribute with one of: IPEI, ALARM, BEACON, CONFIG
    address: Annotated[str | None, element(tag="address", default=None)]
    name: Annotated[str | None, element(tag="name", default=None)]
    location: Annotated[str | None, element(tag="location", default=None)]
    status: Annotated[int | None, element(tag="status", default=None)]
    status_info: Annotated[str | None, element(tag="statusinfo", default=None)]

class SenderDataDef(RootXmlModel, tag="senderdata"):
    root: tuple[PersonDataDef, ...]

    @classmethod
    def __build_serializer__(cls) -> None:
        super().__build_serializer__()
        patched_deserialize = partial(cls._deserialize, cls.__xml_serializer__)
        setattr(cls.__xml_serializer__, "deserialize", patched_deserialize)
        patched_serialize = partial(cls._serialize, cls.__xml_serializer__)
        setattr(cls.__xml_serializer__, "serialize", patched_serialize)

    @classmethod
    def _deserialize(
        cls, self: ModelSerializer, element_: XmlElementReader | None, *, context: dict[str, Any] | None
    ) -> BaseXmlModel | None:
        items = []
        item = {}
        for child_element in element_:
            if child_element.tag in item:
                items.append(item)
                item = {}
            item[child_element.tag] = "" if child_element.text is None else child_element.text
        if len(item) > 0:
            items.append(item)
        result = tuple(PersonDataDef.model_validate(item) for item in items)
        return self._model.model_validate(result, strict=False, context=context)

    @classmethod
    def _serialize(
        cls,
        self: ModelSerializer,
        element_: "XmlElementWriter",
        value: BaseXmlModel,
        encoded: Dict[str, Any],
        *,
        skip_empty: bool = False,
    ) -> XmlElementWriter | None:
        for item in encoded:  # type: dict
            for tag, text in item.items():
                if text is not None:
                    e = element_.make_element(tag, None)
                    e.set_text(text)
                    element_.append_element(e)
        return element_

    def __len__(self):
        return self.root.__len__()

    def __getitem__(self, item):
        return self.root.__getitem__(item)

Now that I have that model working it is failing on many others (simple models that should work). There seems to be something wrong with upstream deserialization. I believe the issue comes from xml.etree.ElementTree.Element; empty string is becoming null mistakenly. Regardless, it should be caught. None from should be converted to ''.

Look at this line of code:

item[child_element.tag] = "" if child_element.text is None else child_element.text

Please observe how python_xml isn't compatible with itself in the following example:

from typing import Annotated
from pydantic_xml import BaseXmlModel, element

class PersonDataDef(BaseXmlModel, populate_by_name=True, skip_empty=False, tag="persondata"):
    # # address field may have type= attribute with one of: IPEI, ALARM, BEACON, CONFIG
    address: Annotated[str | None, element(tag="address", default=None)]
    name: Annotated[str | None, element(tag="name", default=None)]
    location: Annotated[str | None, element(tag="location", default=None)]
    status: Annotated[int | None, element(tag="status", default=None)]
    status_info: Annotated[str | None, element(tag="statusinfo", default=None)]

input_xml = "<persondata><address>991</address><status>0</status><statusinfo/></persondata>"
input_model = PersonDataDef.from_xml(input_xml)
output_xml = input_model.to_xml(skip_empty=True)
output_model = PersonDataDef.from_xml(output_xml)
assert input_model == output_model

Would you like me to create a new issue or shall we change the title of this issue to something like: "empty string becoming None during deserialization" and re-purpose this issue as it hasleave all the texteverything in

dapper91 commented 8 months ago

Please observe how python_xml isn't compatible with itself in the following example ...

Empty texts indeed become None during deserialization, and that is how the underlying deserialization libraries work (xml.etree, lxml) which seems reasonable since the text values are not actually provided.

The problem during serialization is that xml doesn't support None types natively as json for example, so it is not obvious how to serialize them. There are many ways to do that: 'None', 'none', 'nil', '', 'xsi:nil', etc, I have seen all of them. So right now it is up to the developer to solve that. If you wish to alter the default serialization format you can define your own type like this:

InnerType = TypeVar('InnerType')
XmlOptional = Annotated[
    Optional[InnerType],
    PlainSerializer(lambda val: val if val is not None else ''),
]

and use it instead of Optional.

Anyway I am thinking about changing None default serialization format in the next release.

dapper91 commented 8 months ago

@skewty starting from 2.3.0 None value is encoded as an empty string.

skewty commented 8 months ago

All my model unit tests pass on 2.3.0. I went looking for a financial contribution method on the GitHub project page and didn't see one. Maybe checkout https://github.com/sponsors when you have some free time. Sincere thank you.