RazrFalcon / roxmltree

Represent an XML document as a read-only tree.
Apache License 2.0
434 stars 37 forks source link

getting an XML representation of a node #83

Closed faassen closed 1 year ago

faassen commented 1 year ago

Thanks for this library! I wrote lxml a long time ago, though the API used there is taken from ElementTree.

I'd like to be able to display the XML representation of a node in the tree. I realize I can't change the XML but this is still useful for debugging and in tests.

I can get quite far with node.position(), combined with doc.input_text(). That gets me the start position. But I don't know how to get the end position of the node. If there is a next sibling I can use the start position of that, but not all nodes have such a position. The position next descendant won't work as it potentially includes the end tag of an outer node, for instance if I have <container><a><b/></a><x/></container>, and I want to see <b/> then the position of the next descendant is <x/> so therefore the output would be <b/></a>.

Would it be possible to maintain the end position of a node as well?

faassen commented 1 year ago

I tried parsing some XML and comparing that with the original to accomplish a similar effect, but that won't work either as they don't compare equal: the document is different. If there was a way to compare nodes ignoring the document itself that could also work.

RazrFalcon commented 1 year ago

Hi! And thanks for lxml! roxmltree has a completely different API, but uses sort of the same parsing logic as lxml. Basically, roxmltree produces the same nodes tree as lxml.

As for the end token, this feature was removed just 2 days ago, since it was just a waste of memory. You can try using v0.15 for now. The only reason we preserve node and attribute positions is so that the caller can have nicer error reporting. And for that you only need the start position.

What exactly are you trying to achieve?

Also note that node positions become kinda useless when XML has DTD entities.

faassen commented 1 year ago

I'm trying to be able to express things like this in tests:

assert_eq!(render(some_node), "<p>Hello!</p>")

I'm working on an xml diffing algorithm and there are a lot of ranges involved. Right now I test those ranges by comparing node numbers, but that's rather hard to read and verify compared to just seeing the XML itself. For testing purposes it's not a big problem as I can simply avoid DTD entities in tests, but of course that does make the solution less general.

Instead of an end token of course the ability to serialize a subtree to XML would also solve my issue. Or alternatively, a way to compare nodes for equality without comparing their document as well as Eq appears to be doing; I guess I can write that myself!

RazrFalcon commented 1 year ago

In this case I would suggest finding a different way. Preserving end positions is very expensive and useless in 99% of the cases.

As for XML diffing, I think you should go a layer down, to a streaming XML parser, like xmlparser used by roxmltree. It preserves everything. And roxmltree doesn't have a goal of preserving such information. Technically, even attributes normalization will break your use cases.

tomjw64 commented 1 year ago

Since you are doing diffing, how are you handling insignificant whitespace? Is this supposed to be structural diffing or not? If you want to ignore insignificant whitespace, then even an end position won't necessarily help you.

faassen commented 1 year ago

The use case has nothing to do with diffing and everything with being able to write nice unit tests where I can verify a node I found by some algorithm is indeed the one I expected.

I thought it might be possible to get the XML text out in order to write such tests, but if that's not feasible I can easily compare nodes - I realized I in fact have a tree hash already.

For this testing purpose the significant whitespace issue isn't a big deal. For the (structural) diffing use case it may be, but a preprocessor to remove significant whitespace should be enough to fix that.

I am sorry I distracted the discussion by mentioning diffing. I understand you are trying to think along, but please do assume I have an understanding of the types of tools available in this domain, and have done research into the problem I am working on. Whether this library is right for the job is my concern; if it's not it's going to be because I may need mutability in the end, but I am trying to put that off.

RazrFalcon commented 1 year ago

You could try previous version, which had end positions. If it works for you, I can backport it.

faassen commented 1 year ago

Since the end positions were removed for a reason I think I want to go with another approach. What I think I'll do is parse the XML I want to compare with in the test and construct a tree hash for it, and then use that for comparison.

g2p commented 1 year ago

I also have a use-case for tracking node ranges: I would like to rewrite an XML document by doing some selective copying to build a new one. I understand that roxmltree is read-only, but for my use case this will work without the need for more complex DOM manipulation. As far as I understand, this doesn't need to impact memory use, looking at the next_subtree position should be sufficient.

RazrFalcon commented 1 year ago

As far as I understand, this doesn't need to impact memory use, looking at the next_subtree position should be sufficient.

Are you sure? The only way to get the actual node end is to look at end position of close token. Which is not preserved at the moment. With next_subtree you can only get the start of the next node.

And once again, I'm not sure what exactly are you trying to do, but I'm sure it will not work. Node ranges are a bit abstract in XML. That's why end positions were removed in the first place. They're kinda meaningless.

tomjw64 commented 1 year ago

Just for the opportunity of improving my own understanding, what's an example situation in which the idea of @g2p, i.e. using the start of next_subtree, will not work (perhaps with a little trimming)?

g2p commented 1 year ago

The problem is that closing tags aren't represented. So if the next subtree is not a sibling, you're going to get closing tags from a parent element.

faassen commented 1 year ago

I'll close this issue for now as I will look for other solutions to my comparison problem; thanks for the help! Still happily using roxmltree

faassen commented 1 year ago

I closed this earlier today but I just developed new use cases for this, so sorry for closing earlier:

The main algorithm I'm building can be fully done with immutable XML and roxmltree's ability to look up an XML node quickly by node id is extremely useful.

But I need several supplementary operations involve generating XML and XML DOM mutations. Clearly that's out of scope for roxmltree.

But it would really help a lot in interfacing with mutable libraries if I could easily serialize a roxmltree node found by my algorithm to a string with an XML fragment.

Would you be open to a contribution that introduces a trait that can do this serialization?

(there is an alternative approach but it's potentially more expensive: it would involve reparsing the original XML with a library that does support serialization, and implementing the ability to get a node by roxmltree node id on top of it)

RazrFalcon commented 1 year ago

Sorry, but roxmltree is read only. It doesn't generate/write XML.

faassen commented 1 year ago

Ah, sorry, I interpreted "read only" to mean "cannot modify the DOM tree", not "does not offer a representation of the XML it parsed".

I will give it one more try:

So let's imagine someone wants to implement an xpath tool on top of roxmltree, which supports commands along these lines:

$ xmllint --xpath "//title[@lang='fr']" books.xml
<title lang="fr">The Little Prince</title>

This tool does not need to modify the DOM tree at all, only read it. But it does need to serialize the XML that comes out of the xpath expression for CLI purposes.

I realize that xpath is stated as not a goal of roxmltree, but do you think that roxmltree should not offer foundational support at all? The readme says:

"Why read-only? Because in some cases all you need is to retrieve some data from an XML document. And for such cases, we can make a lot of optimizations."

Isn't retrieving some data what this tool is doing? Yes, the data is in the form of an XML fragment, but this is after all an XML library.

If this remains a non-goal I won't open a PR and will ship the serialization code I have as a separate library so it's useful to others with the same problem. I just figure it would be useful as a core feature.

RazrFalcon commented 1 year ago

roxmltree is designed to be as simple as possible. So my answer is still no. It will not have any features except the currently present one.

The only solution I could suggest is forking or writing a crate on top of it.

adamreichold commented 1 year ago

I think writing code to serialize roxmltree::Node into a valid XML representation should be straight forward on top of this crate as long as it does not have to be exactly the original XML which was parsed to create the node.

adamreichold commented 1 year ago

(Also with Rust's packaging, there is little reason to not have a separate crate besides discoverability.)

faassen commented 1 year ago

Thanks, I will package my code up in a separate crate then.

tomjw64 commented 1 year ago

@faassen I would possibly be interested in contributing to a crate that does this (in the past when I needed this, I just converted the roxmltree fragment to a series of xml-rs XmlEvent when I needed to do this, but a more tailored approach could be quite interesting)

faassen commented 1 year ago

Cool! I have gone for a simple recursive approach that adds strings to a buffer as opposed to using an event emitter. I will try to put it up next week so we can compare notes.

faassen commented 1 year ago

Since I needed an XML library that allowed manipulation and serialization I built a new one on top of xmlparser. That's no slight on roxmltree; it's an excellent library if you only need to read XML. Here's the library:

https://github.com/faassen/xot

So closing this issue again.