xml_tree_walker should call begin()/end() as it enters/exits each node in the traversal

GoogleCodeExporter commented 9 years ago

I would like to start by saying thank you for pugixml -- it was very easy to 
implement and set up.  My problem comes from trying to do a quick expat 
replacement in some existing moderately complicated code, without having to 
revalidate all of the code that uses the data that is read from the XML file.  
The fastest way to do that appeared to be via traverse() and xml_tree_walker, 
but I have a problem.

begin()/end() ONLY fire for the node that you use to call traverse().  I don't 
understand why begin()/end() are used in such a limited way.  If you only 
wanted before/after processing on the root node, just do it before and after 
you call traverse().  

I need to have processing before and after each node is processed DURING 
traversal, because the old expat-based code is heavily dependent on processing 
when it enters an element and when it exits an element.  You can't do that 
inside the for_each() node processor, because begin() needs to happen before 
any processing is done, and end() needs to happen after all the children have 
been processed.  In fact, because of this processing pattern, my for_each() 
override is just "return true;" because there is nothing I need to do EXCEPT at 
entry and exit.

It only takes two lines of code inside xml_node::traverse to "fix" this and 
make it work the way I would have expected.  I have attached a file showing 
what I've done to my copy.  I understand that changing this now would impact 
your users, but it might be worth having a discussion to see if anybody uses 
begin()/end() or how other users feel about this.  If changing this behavior is 
a problem, I would be happy with adding two new virtuals: for_each_begin and 
for_each_end.

If I have misunderstood the usage model and there is an existing way to get 
some processing done before each traversed node starts and other processing 
done after the local descent is finished, then it might be helpful to 
demonstrate that in the documentation.  I don't see anywhere in traverse() 
where this could be the case, however.

Thank you,
Mac

Original issue reported on code.google.com by MacRei...@gmail.com on 8 Dec 2013 at 10:09

Attachments:

traverse.txt

GoogleCodeExporter commented 9 years ago

D'oh.  My "patch" was overly naive.  Some kind of stack is probably needed.  
This would be trivial in a recursive implementation of depth-first descent 
(relying on function stack), but I think it requires an explicit stack when the 
traversal is iterative.  I'm currently trying to see if I can make this work in 
the current implementation without a stack, but it will be more complicated 
than my initial naive approach.  (Sorry, I just assumed a depth first tree 
traversal would be done recursively...)

Original comment by MacRei...@gmail.com on 8 Dec 2013 at 10:45

GoogleCodeExporter commented 9 years ago

OK, the following appears to work, at least for my simple XML files/trees, and 
does not require an extra stack (so it should not put any extra memory burden 
on anyone, though it does potentially increase processing time, obviously).

Sorry if my editors have mixed spaces and tabs...

Original comment by MacRei...@gmail.com on 8 Dec 2013 at 11:43

Attachments:

traverse.txt

GoogleCodeExporter commented 9 years ago

I agree that the current treatment of begin/end does not make sense; the 
interface was copied verbatim from pugxml; I intended to fix it before 1.0 
release but ended up leaving it as is because it was frequently used. I'm not 
sure how often begin/end are being overridden but that's a pretty radical 
change; I would rather introduce new members, i.e. enter()/leave() or 
node_begin()/node_end() or for_each_begin()/for_each_end().

An alternative is to add a new version of the interface and mark the existing 
one deprecated; I'm still figuring out if preserving binary compatibility is a 
priority for next pugixml version (may be important since pugixml is being 
distributed as a package for some Linux distros; adding virtual functions to an 
existing interface is unlikely to preserve ABI).

Original comment by arseny.k...@gmail.com on 20 Dec 2013 at 5:37

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Thank you for considering it.  As it turns out, I ended up not needing it, but 
if you'd been wanting to do it for awhile anyway, maybe my patch will help some.

(As for why I don't need this change anymore:  I was using the entry/exit 
callbacks to simulate expat callbacks to support someone else's code.  When I 
dug deeper into their code, I saw that they'd gone to great lengths in their 
expat callbacks to build up context so that they could work in terms of nodes.  
Which meant that I was tearing nodes apart to fit their interface so that they 
could build the nodes back up to do their processing...  So, when I rewrote 
their code to just work on the nodes during traversal, it got 20+ times faster, 
and works perfectly fine with pugixml as it is.  Maybe it's better to _not_ 
have the entry/exit code so that people (like me) are forced to consider 
rewriting other code to be more appropriate.  I don't know -- and I don't have 
a wide enough experience base to guess.)

Original comment by MacRei...@gmail.com on 20 Dec 2013 at 2:56

GoogleCodeExporter commented 9 years ago

Thank you for the feedback - I did not think about this in such terms. I 
assumed that enter/exit callbacks that you use are sufficiently complicated 
that you don't want to change their code, but I probably should have suggested 
using something other than traverse().

For pugixml 1.0 I actually considered either removing begin/end outright, or 
just removing the entire interface. It feels a bit out-of-place, and in general 
I find that visitor abstraction for an XML tree is not very valuable -- often 
inverting the control by i.e. recursively traversing yourself makes the code 
easier.

There is one thing that traverse() does that can make it useful - it performs a 
stackless traversal of the subtree, which is not trivial to replicate. The only 
other way of doing it without implementing it yourself is to use 
xml_node::find_node, but that's a bit of a hack, i.e.:

node.find_node([](xml_node n) -> bool { printf("%s\n", n.name()); return false; 
});

A few people have suggested introducing a subtree iterator that would allow you 
to control the traversal, which is likely a good direction to go here.

Original comment by arseny.k...@gmail.com on 20 Dec 2013 at 4:39

GoogleCodeExporter commented 9 years ago

The original callbacks do appear complicated -- or at least very large -- at 
first sight, hence my initial reluctance to modify them at all.  And the rest 
of the code has proven to be a bit fragile, adding to my concerns.  So I 
originally erred on the side of absolute minimal changes to existing code.  But 
bumping load times from ~3 seconds to up over a minute forces one to 
re-evaluate :)

I am still using the traverse() method because it lets me use an only slightly 
modified version of the original author's logic.  But as I made those 
modifications, I saw that it does have some of the "inverted logic" you 
mention, where the visitor code says "if my parent is X and my grandparent is 
Y, then do F with me."  That's hideously inefficient at a low level because of 
all the parent lookup and comparison.  I realized that I could probably replace 
it all with direct path lookups and a few sibling iterations, which should be 
noticeably faster.  It's all a matter of performance payoff versus the risk of 
introducing a subtle bug, and the current approach is "fast enough" while 
retaining all of the original logic.  But it is rather "bleah..." to look at.

As a side note: I do like the stackless traversal for known upper limits on 
memory usage when used on embedded devices, though you'll notice that I wasn't 
expecting it in my original patch, which was hopelessly naive and assumed a 
recursive process.

At this point, between my own limited experiences and the ones you've 
described, I'm not sure it's worth putting in my original request.  Your 
mention of just removing begin/end sounds reasonable, as their current 
implementations are of minimal use and trivial to do as simple calls before and 
after traverse().  (Dependent, of course, on your previously mentioned ABI 
concerns)  Since that would require some updates to the documentation, it might 
also be worth adding some quick discussion in the traverse() documentation of 
the (lack of) applicability of the visitor patterns in real-world XML.

Thank you for talking this out with me.

Original comment by MacRei...@gmail.com on 20 Dec 2013 at 5:11

GoogleCodeExporter commented 9 years ago

At this point it seems like the only good solution is leaving the original 
interface in-tact (compatibility issues prevent me from removing begin/end or 
changing the behavior). We'll see if this resurfaces in case anybody else needs 
such callbacks; closing as WontFix for now.

Original comment by arseny.k...@gmail.com on 27 Jan 2014 at 12:52

Changed state: WontFix

amorlzu / pugixml

xml_tree_walker should call begin()/end() as it enters/exits each node in the traversal #219