Closed faassen closed 1 year ago
I tried parsing some XML and comparing that with the original to accomplish a similar effect, but that won't work either as they don't compare equal: the document is different. If there was a way to compare nodes ignoring the document itself that could also work.
Hi! And thanks for lxml
!
roxmltree
has a completely different API, but uses sort of the same parsing logic as lxml
. Basically, roxmltree
produces the same nodes tree as lxml
.
As for the end token, this feature was removed just 2 days ago, since it was just a waste of memory. You can try using v0.15 for now. The only reason we preserve node and attribute positions is so that the caller can have nicer error reporting. And for that you only need the start position.
What exactly are you trying to achieve?
Also note that node positions become kinda useless when XML has DTD entities.
I'm trying to be able to express things like this in tests:
assert_eq!(render(some_node), "<p>Hello!</p>")
I'm working on an xml diffing algorithm and there are a lot of ranges involved. Right now I test those ranges by comparing node numbers, but that's rather hard to read and verify compared to just seeing the XML itself. For testing purposes it's not a big problem as I can simply avoid DTD entities in tests, but of course that does make the solution less general.
Instead of an end token of course the ability to serialize a subtree to XML would also solve my issue. Or alternatively, a way to compare nodes for equality without comparing their document as well as Eq
appears to be doing; I guess I can write that myself!
In this case I would suggest finding a different way. Preserving end positions is very expensive and useless in 99% of the cases.
As for XML diffing, I think you should go a layer down, to a streaming XML parser, like xmlparser
used by roxmltree
. It preserves everything.
And roxmltree
doesn't have a goal of preserving such information. Technically, even attributes normalization will break your use cases.
Since you are doing diffing, how are you handling insignificant whitespace? Is this supposed to be structural diffing or not? If you want to ignore insignificant whitespace, then even an end position won't necessarily help you.
The use case has nothing to do with diffing and everything with being able to write nice unit tests where I can verify a node I found by some algorithm is indeed the one I expected.
I thought it might be possible to get the XML text out in order to write such tests, but if that's not feasible I can easily compare nodes - I realized I in fact have a tree hash already.
For this testing purpose the significant whitespace issue isn't a big deal. For the (structural) diffing use case it may be, but a preprocessor to remove significant whitespace should be enough to fix that.
I am sorry I distracted the discussion by mentioning diffing. I understand you are trying to think along, but please do assume I have an understanding of the types of tools available in this domain, and have done research into the problem I am working on. Whether this library is right for the job is my concern; if it's not it's going to be because I may need mutability in the end, but I am trying to put that off.
You could try previous version, which had end positions. If it works for you, I can backport it.
Since the end positions were removed for a reason I think I want to go with another approach. What I think I'll do is parse the XML I want to compare with in the test and construct a tree hash for it, and then use that for comparison.
I also have a use-case for tracking node ranges: I would like to rewrite an XML document by doing some selective copying to build a new one. I understand that roxmltree is read-only, but for my use case this will work without the need for more complex DOM manipulation. As far as I understand, this doesn't need to impact memory use, looking at the next_subtree position should be sufficient.
As far as I understand, this doesn't need to impact memory use, looking at the next_subtree position should be sufficient.
Are you sure? The only way to get the actual node end is to look at end position of close token. Which is not preserved at the moment. With next_subtree
you can only get the start of the next node.
And once again, I'm not sure what exactly are you trying to do, but I'm sure it will not work. Node ranges are a bit abstract in XML. That's why end positions were removed in the first place. They're kinda meaningless.
Just for the opportunity of improving my own understanding, what's an example situation in which the idea of @g2p, i.e. using the start of next_subtree, will not work (perhaps with a little trimming)?
The problem is that closing tags aren't represented. So if the next subtree is not a sibling, you're going to get closing tags from a parent element.
I'll close this issue for now as I will look for other solutions to my comparison problem; thanks for the help! Still happily using roxmltree
I closed this earlier today but I just developed new use cases for this, so sorry for closing earlier:
The main algorithm I'm building can be fully done with immutable XML and roxmltree's ability to look up an XML node quickly by node id is extremely useful.
But I need several supplementary operations involve generating XML and XML DOM mutations. Clearly that's out of scope for roxmltree.
But it would really help a lot in interfacing with mutable libraries if I could easily serialize a roxmltree node found by my algorithm to a string with an XML fragment.
Would you be open to a contribution that introduces a trait that can do this serialization?
(there is an alternative approach but it's potentially more expensive: it would involve reparsing the original XML with a library that does support serialization, and implementing the ability to get a node by roxmltree node id on top of it)
Sorry, but roxmltree
is read only. It doesn't generate/write XML.
Ah, sorry, I interpreted "read only" to mean "cannot modify the DOM tree", not "does not offer a representation of the XML it parsed".
I will give it one more try:
So let's imagine someone wants to implement an xpath tool on top of roxmltree, which supports commands along these lines:
$ xmllint --xpath "//title[@lang='fr']" books.xml
<title lang="fr">The Little Prince</title>
This tool does not need to modify the DOM tree at all, only read it. But it does need to serialize the XML that comes out of the xpath expression for CLI purposes.
I realize that xpath is stated as not a goal of roxmltree, but do you think that roxmltree should not offer foundational support at all? The readme says:
"Why read-only? Because in some cases all you need is to retrieve some data from an XML document. And for such cases, we can make a lot of optimizations."
Isn't retrieving some data what this tool is doing? Yes, the data is in the form of an XML fragment, but this is after all an XML library.
If this remains a non-goal I won't open a PR and will ship the serialization code I have as a separate library so it's useful to others with the same problem. I just figure it would be useful as a core feature.
roxmltree
is designed to be as simple as possible. So my answer is still no. It will not have any features except the currently present one.
The only solution I could suggest is forking or writing a crate on top of it.
I think writing code to serialize roxmltree::Node
into a valid XML representation should be straight forward on top of this crate as long as it does not have to be exactly the original XML which was parsed to create the node.
(Also with Rust's packaging, there is little reason to not have a separate crate besides discoverability.)
Thanks, I will package my code up in a separate crate then.
@faassen I would possibly be interested in contributing to a crate that does this (in the past when I needed this, I just converted the roxmltree fragment to a series of xml-rs XmlEvent when I needed to do this, but a more tailored approach could be quite interesting)
Cool! I have gone for a simple recursive approach that adds strings to a buffer as opposed to using an event emitter. I will try to put it up next week so we can compare notes.
Since I needed an XML library that allowed manipulation and serialization I built a new one on top of xmlparser
. That's no slight on roxmltree; it's an excellent library if you only need to read XML. Here's the library:
https://github.com/faassen/xot
So closing this issue again.
Thanks for this library! I wrote lxml a long time ago, though the API used there is taken from ElementTree.
I'd like to be able to display the XML representation of a node in the tree. I realize I can't change the XML but this is still useful for debugging and in tests.
I can get quite far with
node.position()
, combined withdoc.input_text()
. That gets me the start position. But I don't know how to get the end position of the node. If there is a next sibling I can use the start position of that, but not all nodes have such a position. The position next descendant won't work as it potentially includes the end tag of an outer node, for instance if I have<container><a><b/></a><x/></container>
, and I want to see<b/>
then the position of the next descendant is<x/
> so therefore the output would be<b/></a>
.Would it be possible to maintain the end position of a node as well?