gatkin / declxml

Declarative XML processing for Python
https://declxml.readthedocs.io/en/latest/
MIT License
37 stars 7 forks source link

Create processor from typing annotations #28

Open Huite opened 4 years ago

Huite commented 4 years ago

Hi @gatkin,

This looks like a great package, providing a much saner way to interact with XML data; the documentation is complete and clear as well.

Primitive processors

Being able to use both classes and namedtuples is a very convenient, but I feel there's some duplication of info going on if you're using type annotations (as they've been added in recent Python version). To demonstrate what I mean:

from dataclasses import dataclass
import decxml as xml

@dataclass
class Extent:
    xmin: float
    ymin: float
    xmax: float
    ymax: float

extent_processor = xml.user_object("extent", Extent, [
    xml.floating_point("xmin"),
    xml.floating_point("ymin"),
    xml.floating_point("xmax"),
    xml.floating_point("ymax"),
]

I'm stating twice that the attributes should be floats. It's pretty straightforward to define a function which does this for you:

type_mapping = {
    bool: xml.boolean,
    int: xml.integer,
    float: xml.floating_point,
    string: xml.string,
}

def make_processor(datacls):
    fields = []
    for name, vartype in datacls.__annotations__.items():
        xml_type = type_mapping[vartype]
        field = xml_type(name)
        fields.append(field)
    return xml.user_object(datacls.__name__.lower(), datacls, fields)

extent_processor = make_processor(Extent)

This is all you need for simple processors (for typing.NamedTuple as well, mutatis mutandis).

Aggregate processors

Aggregate processors are easy to include via recursion, although you probably want to encode the "aggregateness" somewhere. After some playing around, I find encoding it in the type to be most straightforward:

import abc

class Aggregate(abc.ABC):
    pass

@dataclass
class Extent(Aggregate):
    xmin: float
    ymin: float
    xmax: float
    ymax: float

@dataclass
class SpatialData(Aggregate):
    epsg: str
    extent: Extent

def make_processor(datacls):
    fields = []
    for name, vartype in datacls.__annotations__.items():
        if issubclass(vartype, Aggregate):
            field = make_processor(vartype)
        else:
            xml_type = type_mapping[vartype]
            field = xml_type(name)
        fields.append(field)
    return xml.user_object(datacls.__name__.lower(), datacls, fields)

spatialdata_processor = make_processor(SpatialData)

This provides a very concise way of defining (nested) data structures -- which I'd generally want to do anyway -- and turn them into XML processors with a single function call and adding a new base class (which can even be monkey-patched at runtime, if needed).

I'm not sure you'd really want to put this in declxml (see the trouble below), but I do think it's useful (and non-trivial) enough to maybe warrant a section in the documentation. What do you reckon?

Optional, List, etc

I haven't tried it yet, but I'm pretty sure you can use typing.Optional and typing.List to map to the declxml equivalents.

Hickups

There's some trouble due to with the fact that XML has a separation between attributes and elements. For the XML's I'm working with, I don't really see a reason to separate between attributes and elements (of course, neither does JSON, or TOML, etc.) But you need to encode it somehow, or it won't end up the in the right place of the XML. But I can solve in it a slightly hacky way, by (ab)using typing.Union:

from typing import Union

class Attribute(abc.ABC):
    pass

@dataclass
class Example:
     a: Union[Attribute, int]
     b: int
     c: int

example = Example(1, 2, 3)

To write an XML:

<example a=1>
<b>2</b>
<c>3</c>
</example>

We can check again by inspecting the annotations:

def is_union(vartype):
    return hasattr(vartype, "__args__") and (vartype.__args__[0] is Attribute)

This shouldn't trip up any type checker, but it is clearly not quite intended use: you'll never provide an Attribute as the value.

There's more issues with the fact that sometimes you need to include names that aren't part of the dataclass or the namedtuple, e.g. an array in the xml, where every entry is tagged "item":

<item value="-5980.333333333333" label="-5980" alpha="255" color="#0d0887"/>
<item value="-5863.464309399999" label="-5863" alpha="255" color="#1b068d"/>

I can't use something as general as "item" as my class name. This how I want to see it in Python:

@dataclass
class Color(Attribute):
    value: str
    label: str
    alpha: str
    color: str

Of course, I can just fall back to regular use at any time, and provide the name which is only part of the processor, not of the dataclass:

color_processor = xml.user_object(
    "item",
    Color,
    [
        xml.string(".", attribute="value"),
        xml.string(".", attribute="label"),
        xml.string(".", attribute="alpha"),
        xml.string(".", attribute="color"),
    ],
)

At any rate, you can just mix and match as needed: when everything's encoded in the dataclass or namedtuple, you can generate the processors automatically; if not, you just have to write a few extra lines or provide an explicit name.

Similarly, there's cases where aliases are required. In my case, I'm lowering class names and replacing underscores by dashes: so it's sorta implicitly defined. Stuff like this makes me think it might be smarter to let the user figure out the details of their idiosyncratic XML format, and provide a "base recipe" to help them along a little.

Or perhaps you see a better way that is nice and general?