gatkin / declxml

Declarative XML processing for Python
https://declxml.readthedocs.io/en/latest/
MIT License
37 stars 7 forks source link

Use attributes as keys for dictionaries #13

Closed EmilStenstrom closed 6 years ago

EmilStenstrom commented 6 years ago

Hi! I'm trying to parse a tricky piece of XML that has some important information in attributes, that I would like to turn into a dictionary. Instead of explaining this in writing, here's some code... I'm trying to write a processor that would make the assert True:

import declxml as xml

data = """
    <item>
        <poll name="suggested_numplayers" title="User Suggested Number of Players" totalvotes="25">
            <results numplayers="1">
                <result value="Best" numvotes="0"/>
                <result value="Recommended" numvotes="0"/>
                <result value="Not Recommended" numvotes="9"/>
            </results>
            <results numplayers="2">
                <result value="Best" numvotes="4"/>
                <result value="Recommended" numvotes="12"/>
                <result value="Not Recommended" numvotes="4"/>
            </results>
            <results numplayers="3">
                <result value="Best" numvotes="11"/>
                <result value="Recommended" numvotes="7"/>
                <result value="Not Recommended" numvotes="1"/>
            </results>
        </poll>
    </item>
"""

processor = ...

parsed = xml.parse_from_string(processor, data)
assert parsed == {
    "players": {
        "1": {"Best": 0, "Recommended": 0, "Not Recommended": 9},
        "2": {"Best": 4, "Recommended": 12, "Not Recommended": 4},
        "3": {"Best": 11, "Recommended": 7, "Not Recommended": 1},
    }
}, parsed

I've tried two different ways to get this working.

  1. Using XPath element/@attribute syntax to "select" the value of the element and use the result of that as the key of the dictionary:
processor = xml.dictionary('item', [
    xml.dictionary("poll[@name='suggested_numplayers']", [
        xml.dictionary("results/@numplayers", [
            xml.dictionary("result/@value", [
                xml.integer("result", attribute="numvotes"),
            ])
        ])
    ], alias="players"),
])

This fails with "KeyError: '@'", likely because it searches for an element, not an attribute.

  1. Making each key a primitive processor and using attribute to filter down to the element's value.
processor = xml.dictionary('item', [
    xml.dictionary("poll[@name='suggested_numplayers']", [
        xml.dictionary(xml.string("results", attribute="numplayers"), [
            xml.dictionary(xml.string("result", attribute="value"), [
                xml.integer("result", attribute="numvotes"),
            ])
        ])
    ], alias="players"),
])

This fails with "TypeError: '_PrimitiveValue' object is not subscriptable", likely because it doesn't expect a primitive processor there, but a string.

Is there any way to get the result I'm looking for? Anything similar to what I'm looking for?

EmilStenstrom commented 6 years ago
  1. Another way could be to filter on just one value, and use alias to set that specific value.
processor = xml.dictionary('item', [
    xml.dictionary("poll[@name='suggested_numplayers']", [
        xml.dictionary("results[@numplayers='1']", [
            xml.integer("result[@value='Best']", attribute="numvotes", alias="Best"),
            xml.integer("result[@value='Recommended']", attribute="numvotes", alias="Recommended"),
            xml.integer("result[@value='Not Recommended']", attribute="numvotes", alias="Not Recommended"),
        ], alias="1"),
        xml.dictionary("results[@numplayers='2']", [
            xml.integer("result[@value='Best']", attribute="numvotes", alias="Best"),
            xml.integer("result[@value='Recommended']", attribute="numvotes", alias="Recommended"),
            xml.integer("result[@value='Not Recommended']", attribute="numvotes", alias="Not Recommended"),
        ], alias="2"),
        xml.dictionary("results[@numplayers='3']", [
            xml.integer("result[@value='Best']", attribute="numvotes", alias="Best"),
            xml.integer("result[@value='Recommended']", attribute="numvotes", alias="Recommended"),
            xml.integer("result[@value='Not Recommended']", attribute="numvotes", alias="Not Recommended"),
        ], alias="3"),
    ], alias="players")
])

Problem with this is that different games (that's what the XML represents in my case), have different values, and I don't know the span ahead of time. So one way would be to rewrite this with a list comprehension, to include all numbers between 1 and 99 (I think that's the span)

processor = xml.dictionary('item', [
    xml.dictionary("poll[@name='suggested_numplayers']", [
        xml.dictionary("results[@numplayers='" + str(numplayers) + "']", [
            xml.integer("result[@value='Best']", attribute="numvotes", alias="Best"),
            xml.integer("result[@value='Recommended']", attribute="numvotes", alias="Recommended"),
            xml.integer("result[@value='Not Recommended']", attribute="numvotes", alias="Not Recommended"),
        ], alias=str(numplayers), required=False)
        for numplayers in range(1, 100)
    ], alias="players")
])

Which is almost what I'm looking for, except for the empty dicts. One way to get back to the desired result would be to add support for omit_empty to xml.dictionary, so empty dicts don't get included.

{'players': {'1': {'Best': 0, 'Recommended': 0, 'Not Recommended': 9}, '2': {'Best': 4, 'Recommended': 12, 'Not Recommended': 4}, '3': {'Best': 11, 'Recommended': 7, 'Not Recommended': 1}, '4': {}, '5': {}, '6': {}, '7': {}, '8': {}, '9': {}, '10': {}, '11': {}, '12': {}, '13': {}, '14': {}, '15': {}, '16': {}, '17': {}, '18': {}, '19': {}, '20': {}, '21': {}, '22': {}, '23': {}, '24': {}, '25': {}, '26': {}, '27': {}, '28': {}, '29': {}, '30': {}, '31': {}, '32': {}, '33': {}, '34': {}, '35': {}, '36': {}, '37': {}, '38': {}, '39': {}, '40': {}, '41': {}, '42': {}, '43': {}, '44': {}, '45': {}, '46': {}, '47': {}, '48': {}, '49': {}, '50': {}, '51': {}, '52': {}, '53': {}, '54': {}, '55': {}, '56': {}, '57': {}, '58': {}, '59': {}, '60': {}, '61': {}, '62': {}, '63': {}, '64': {}, '65': {}, '66': {}, '67': {}, '68': {}, '69': {}, '70': {}, '71': {}, '72': {}, '73': {}, '74': {}, '75': {}, '76': {}, '77': {}, '78': {}, '79': {}, '80': {}, '81': {}, '82': {}, '83': {}, '84': {}, '85': {}, '86': {}, '87': {}, '88': {}, '89': {}, '90': {}, '91': {}, '92': {}, '93': {}, '94': {}, '95': {}, '96': {}, '97': {}, '98': {}, '99': {}}}
gatkin commented 6 years ago

Ooh, that is a bit of a tricky XML parsing.

Right now, declxml does not support using a dynamic value parsed from the XML data as the key to the result dictionary.

Currently, declxml is mostly intended to create faithful representations of data formatted in XML as Python data types to allow for simple, straightforward conversions of data between XML and Python. It is not really intended to parse directly into the domain objects you want to use in your core business logic. In my applications, I usually treat values written and read by declxml as simple, dumb DTO values used at the boundaries of the application rather than internal domain objects used in the core of the application.

Currently, with decxml, my best recommendation would maybe be to keep the parsing simple by reading in the XML data mostly as is and then, once the data has been read in, apply a transformation to get the data into the format you want to use in your domain objects. This could look something like

processor = xml.dictionary('item/poll', [
    xml.string('.', attribute='name'),
    xml.string('.', 'title'),
    xml.integer('.', attribute='totalvotes'),
    xml.array(xml.dictionary('results', [
        xml.string('.', attribute='numplayers'),
        xml.array(xml.dictionary('result', [
            xml.string('.', attribute='value'),
            xml.integer('.', attribute='numvotes'),
        ]), alias='results')
    ]))
], alias='poll')

which would read in data that looks like

{'totalvotes': 25, 'results': [{'numplayers': '1', 'results': [{'value': 'Best', 'numvotes': 0}, {'value': 'Recommended', 'numvotes': 0}, {'value': 'Not Recommended', 'numvotes': 9}]}, {'numpl
ayers': '2', 'results': [{'value': 'Best', 'numvotes': 4}, {'value': 'Recommended', 'numvotes': 12}, {'value': 'Not Recommended', 'numvotes': 4}]}, {'numplayers': '3', 'results': [{'value': 'B
est', 'numvotes': 11}, {'value': 'Recommended', 'numvotes': 7}, {'value': 'Not Recommended', 'numvotes': 1}]}], 'name': 'suggested_numplayers', 'title': 'User Suggested Number of Players'}

which you could then massage and transform into the exact data shape you need.

gatkin commented 6 years ago

On the other hand, it may be very useful to be able to perform arbitrary transformations on data read in from declxml to get the data into exactly the format you want to consume it.

One possibility for achieving this would be to allow processors to supply arbitrary value transformers (not sure of the best name at the moment). These value transformers would simply be a pair of functions: one function that takes the data as read by decxml and transforms it to whatever data structure you want to use internally for your application that will be invoked whenever you are parsing from XML, and a second function that transforms the internal data structure to the data structure that matches the shape of the XML data that will be invoked whenever you are serializing to XML.

This would look something like this:

def from_xml(xml_data):
    players = {}
    for result in xml_data:
        player_key = result['numplayers']
        player = {}
        for player_result in result['results']:
            player[player_result['value']] = player_result['numvotes']

        players[player_key] = player

    return players

def to_xml(value):
    xml_data = []
    for player_key, player in value.iteritems():
        votes = []
        for value, num_votes in player.iteritems():
            votes.append({'value': value, 'numvotes': num_votes})

        player_xml = {
            'numplayers': player_key,
            'results': votes,
        }

        xml_data.append(player_xml)

    return xml_data

array_value_mapper = xml.ValueMapper(from_xml_data=from_xml, to_xml_data=to_xml)

processor = xml.dictionary('item/poll', [
    xml.string('.', attribute='name'),
    xml.string('.', 'title'),
    xml.integer('.', attribute='totalvotes'),
    xml.array(xml.dictionary('results', [
        xml.string('.', attribute='numplayers'),
        xml.array(xml.dictionary('result', [
            xml.string('.', attribute='value'),
            xml.integer('.', attribute='numvotes'),
        ]), alias='results')
    ]), mapper=array_value_mapper)
], alias='poll')

This may also be a good solution for #12 as well since you could use it to first have declxml read the raw XML data into a dictionary, and then you could supply a function that takes the dictionary and constructs and initializes the object however you want and returns it.

@EmilStenstrom what do you think of that solution?

gatkin commented 6 years ago

If you interested, you can checkout the quick and dirty change I made to prototype this in #14

EmilStenstrom commented 6 years ago

I REALLY appreciate the long, thoughtful answers. Thanks!

I hadn't thought about the distinction between dumb DTO objects and domain objects. Great catch. In my mind, I was trying to transform the XML into my domain objects.

The real use-case is a flow that looks like this:

  1. Fetch one XML file from a REST API and get the metadata for some boardgames from that file (this was really easy using declxml)
  2. Using the ids of each game in the previous request, make a new request for more detailed information about each game. Turn that XML response into a domain object, and add the metadata from the previous request to that object. Here I had problems turning XML -> domain object, which as you say isn't the goal of declxml).

When considering the distinction you brought up, I think I would be better served by just converting from XML -> dict (twice) with declxml, and then manually send the two data dicts to my custom constructor, and let it do the conversion needed to turn the dicts into my domain objects.

I think I got fooled by the User defined classes feature, thinking those classes could be my domain objects. Looking at the two feature enhancements I posted, they are both ways to enable that to work, and since that's not the point of the library, maybe they are not such a good fit after all?

EmilStenstrom commented 6 years ago

Given that you would want to support turning XML -> domain objects, I think your ValueMapper approach would be a great feature! I would definitely use it!

Since I don't use the library to serialize, only to deserialize, I would want the toxml-function to be optional, so I didn't have to go through the hassle if I didn't need it.

EmilStenstrom commented 6 years ago

I see you deleted #14, does that mean you changed your mind about this feature?

gatkin commented 6 years ago

Oh no, sorry about that. I deleted #14 since that was just meant as a quick and dirty prototype, and I have started doing the real work on a separate branch #16. It is sill in progress, I need to finish adding mapping support for namedtuples and user_objects as well as adding unit tests for those processors. I would also like to update the documentation for the new functionality before I do another PyPi release.

I have implemented it so that each mapping function (to_xml and from_xml) can be optional if you don't need it. An exception will be thrown if it is missing and you're trying to use it.

Feel free to try out the changes if you'd like or comment on the PR, I'd love any feedback before I publish a new release to PyPi.

Regarding your point on whether these feature enhancements fit declxml, I think they do fit, although in the documentation I plan to mark the use of ValueMappers as an advanced use case and recommend explicitly against using it unless it is really needed. Usually, I do like to keep the XML processing at the edge of my application away from the core, but I think it makes sense to give people the flexibility to do what they want, but have the defaults be simple.

EmilStenstrom commented 6 years ago

No worries, I wasn't using the branch. I was just wondering what direction you where going. Really happy to see you going forward with ValueMappers!

Some random ideas:

Thanks for giving of your free time to random people on the internet! ;)

gatkin commented 6 years ago

No problem, it's great to get good feedback, and I am glad to know that this little library is useful to others.

I will definitely let you know when it is ready for testing before I publish to PyPi; it will be great to get feedback before publishing!

gatkin commented 6 years ago

16 should be ready to test if you're interested in checking it out before I release a new version to PyPi. I plan on adding some more to the documentation in the next few days before I release a new version. I'd love to hear any feedback!

gatkin commented 6 years ago

I've merged in #16, but I am not quite ready to cut another release to PyPi just yet and commit to the additions to the declxml API. I am thinking that the ValueTransform concept can be generalized to be processing "hooks" that allow a user-provided callback to be invoked at any time during either parsing or serialization. These hooks are just arbitrary Python callable objects that can do whatever the caller wants them to do including value transformation, debugging/tracing (as you suggested), or validation.

I am now thinking that instead of receiving just the value that was parsed or is to be serialized, these hooks could also receive a ProcessorState object which declxml uses internally to provide detailed error messages which includes information about where the processor was in the XML document when it encountered an error. I think this could be particularly useful for debugging purposes or for performing validation. I'm thinking it might look something like this:

import declxml as xml

def from_xml(state, value):
    if value < 0:
        state.raise_error('Negative numbers not allowed. Received {}'.format(value))
    return value

processor = xml.dictionary('data', [
    xml.integer('non-negative-value', hooks=xml.Hooks(from_xml=from_xml)),
])

xml_string = """
<data>
    <non-negative-value>-37</non-negative-value>
</data>
"""

xml.parse_from_string(processor, xml_string)
Traceback (most recent call last):
XmlError: Negative numbers not allowed. Receive -37 at data/non-negative-value
EmilStenstrom commented 6 years ago

Wow, that's even nicer!

Opens up for the creation of more sophisticated types, like in the example your suggested. Django (web framework I'm very familiar with), has fields with encapsulate logic like the ones that become possible with this new API: https://docs.djangoproject.com/en/2.0/ref/models/fields/#positiveintegerfield

gatkin commented 6 years ago

Support for processing hooks is now available in the latest PyPi release. Thank you again for all your feedback!

EmilStenstrom commented 6 years ago

❤️

EmilStenstrom commented 6 years ago

Hi again! Sorry for the long feedback cycle here, but I finally got time to try hooks out for my project. It works really well in general, thanks!

One thing I'm missing, is access to the current elements attributes. I have this XML:

<item>

    <name>7 Wonders</name>

    <status own="1" preordered="1" fortrade="0" ... />

</item>


And would like the data in this format:

{"name": "7 Wonders", "tags": ["own", "preordered"]}

The natural way to do this (in my mind...) was this:

xml.string("status", hooks=xml.Hooks(after_parse=status_hook))

But I don't get access to the attributes of the status tag in status_hook unfortunately.

The next thing I tried was to use hooks to filter out some values like this:

xml.dictionary("status", [
    xml.string(".", attribute="fortrade", hooks=xml.Hooks(after_parse=value_if_one)),
    xml.string(".", attribute="own", hooks=xml.Hooks(after_parse=value_if_one)),
    xml.string(".", attribute="preordered", hooks=xml.Hooks(after_parse=value_if_one)),
    xml.string(".", attribute="prevowned", hooks=xml.Hooks(after_parse=value_if_one)),
    xml.string(".", attribute="want", hooks=xml.Hooks(after_parse=value_if_one)),
    xml.string(".", attribute="wanttobuy", hooks=xml.Hooks(after_parse=value_if_one)),
    xml.string(".", attribute="wanttoplay", hooks=xml.Hooks(after_parse=value_if_one)),
    xml.string(".", attribute="wishlist", hooks=xml.Hooks(after_parse=value_if_one)),
], alias="tags"),

But again, I don't have access to the current attribute I'm parsing, so I can't decide to include the attribute or not like that either.

Again: Sorry for the getting back so late with feedback, I understand if this is something I should do as a separate processing step after declxml, instead of using hooks.

gatkin commented 6 years ago

No worries, glad it has been able to work out for you for the most part.

I think in this use case, you want to have your hook run after you parse all of the status values so that you can collect them and flatten them down into the tags that are present. That means that the after_parse hook should be associated with the status dictionary processor because that will be the value that will allow you to see all of the tags at once. So I think you may want something like this:

import declxml as xml

data = """
<item>
    <name>7 Wonders</name>
    <status own="1" preordered="1" fortrade="0" want="0" />
</item>
"""

def after_status_hook(_, status):
    return [tag for tag, present in status.items() if present == '1']

processor = xml.dictionary('item', [
    xml.string('name'),
    xml.dictionary('status', [
        xml.string('.', attribute='own'),
        xml.string('.', attribute='preordered'),
        xml.string('.', attribute='fortrade'),
        xml.string('.', attribute='want'),
    ], alias='tags', hooks=xml.Hooks(after_parse=after_status_hook))
])

print(xml.parse_from_string(processor, data))
# {'name': '7 Wonders', 'tags': ['own', 'preordered']}

Does that work for you? I think I probably should work on making the documentation on how Hooks work more clear. I also think it may be a good idea to add some examples of how to handle common scenarios like this on the documentation page as well.

EmilStenstrom commented 6 years ago

Oh, that makes a lot of sense. I should of course let declxml first do as much work as possible, and then let my hook do post-parsing. Thanks again for clarifying!

ekardon commented 5 years ago

I don't see any mapper functionality on the latest version. why is it closed?