dapper91 / pydantic-xml

python xml for humans
https://pydantic-xml.readthedocs.io
The Unlicense
141 stars 14 forks source link

Clarify ambiguity for empty element text #154

Closed bodograumann closed 5 months ago

bodograumann commented 6 months ago

There is some ambiguity, how an element with empty text content should be parsed.

For example I have the following schema:

  <xsd:complexType name="Country" mixed="false">
    <xsd:simpleContent>
      <xsd:extension base="xsd:string">
        <xsd:attribute name="code" type="xsd:string" use="required"/>
      </xsd:extension>
    </xsd:simpleContent>
  </xsd:complexType>

Naively I would implement this with pydantic-xml as

from pydantic_xml import BaseXmlModel

class Country(BaseXmlModel):
    name: str
    code: str = attr()

This does not allow the name to be the empty string however. The following is a valid document according to the schema:

<Country code="GER"></Country>

However, when parsing with pydanic_xml, I get an error:

>>> Country.from_xml("""<Country code="GER"></Country>""")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bodo/.cache/pypoetry/virtualenvs/holper-qGY2Aevk-py3.11/lib/python3.11/site-packages/pydantic_xml/model.py", line 346, in from_xml
    return cls.from_xml_tree(etree.fromstring(source), context=context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bodo/.cache/pypoetry/virtualenvs/holper-qGY2Aevk-py3.11/lib/python3.11/site-packages/pydantic_xml/model.py", line 329, in from_xml_tree
    obj = typing.cast(ModelT, cls.__xml_serializer__.deserialize(XmlElement.from_native(root), context=context))
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bodo/.cache/pypoetry/virtualenvs/holper-qGY2Aevk-py3.11/lib/python3.11/site-packages/pydantic_xml/serializers/factories/model.py", line 187, in deserialize
    return self._model.model_validate(result, strict=False, context=context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bodo/.cache/pypoetry/virtualenvs/holper-qGY2Aevk-py3.11/lib/python3.11/site-packages/pydantic/main.py", line 503, in model_validate
    return cls.__pydantic_validator__.validate_python(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for Country
name
  Field required [type=missing, input_value={'code': 'GER'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.5/v/missing

So I think we should either:

I am not sure which is the better option. The current implementation gives the user more options, but it is not in line with xml schema.

jwfraustro commented 6 months ago

@bodograumann If you would like name to be optional, you should type-hint it as Optional. This follows the Pydantic convention for optional fields.

For example, the following code achieves what you are looking for, with an example on how it works for countries with and without a name value:

from typing import Optional

from pydantic_xml import BaseXmlModel, attr

country_w_name = """<Country code="GER">Germany</Country>"""
country_wo_name = """<Country code="GER"></Country>"""

class Country(BaseXmlModel):
    name: Optional[str] = None  # or ""
    code: str = attr()

named_country = Country.from_xml(country_w_name)
unnamed_country = Country.from_xml(country_wo_name)

print(named_country.to_xml())
# <Country code="GER">Germany</Country>
print(unnamed_country.to_xml())
# <Country code="GER"></Country>
bodograumann commented 6 months ago

The point I wanted to make, @jwfraustro: it is not obvious whether in """<Country code="GER"></Country>""" the content is missing, or whether it is the empty string. You could define it either way. XML Schema seems to indicate that it should be interpreted as an empty string, while pydantic-xml currently interpretes it as missing. So either pydantic-xml should also regard it as an empty string, or the documentation should be expanded to make it clearer for the user.

Btw:

jwfraustro commented 6 months ago

Thank you for the linting advice; pipe unions are not always a valid option in the code I support. I was only hoping to offer you a solution to the problem at hand.

I see your point regarding xml considering it an empty string. I think you are probably correct in your interpretation.

I will note that the documentation does note that it regards unset values as None. It is mentioned here: https://pydantic-xml.readthedocs.io/en/latest/pages/misc.html#optional-type-encoding This section applies to encoding, but I wager pydantic-xml presumes and behaves the same when parsing xml.

mcleantom commented 6 months ago

Is the answer to just default to an empty string like so:

from pydantic_xml import BaseXmlModel, attr

class Country(BaseXmlModel, skip_empty=True):
    name: str = ""
    code: str = attr()

country_w_name = """<Country code="GER">Germany</Country>"""
country_wo_name = """<Country code="GER"></Country>"""

named_country = Country.from_xml(country_w_name)
unnamed_country = Country.from_xml(country_wo_name)

print(named_country.to_xml())
print(unnamed_country.to_xml())

Output:

b'<Country code="GER">Germany</Country>'
b'<Country code="GER"></Country>'
dapper91 commented 5 months ago

@bodograumann Hi.

The library deserializer receives the element text from the underlying xml parser (lxml or xml.etree) without any changes. They both interpret an empty element text as None not as an empty string:

>>> from lxml import etree
>>>
>>> xml = '<Country code="GER"></Country>'
>>>
>>> root = etree.fromstring(xml)
>>> print(root.text)
None

I will clarify that in the documentation.

dapper91 commented 5 months ago

added it to the documentation.

bodograumann commented 5 months ago

Thank you @dapper91 . I think this will be very helpful to newcomers.