ballsteve / xrust

XPath, XQuery, and XSLT for Rust
Apache License 2.0
84 stars 7 forks source link

Order of attributes in to_xml output is unstable #98

Open mickdekkers opened 1 month ago

mickdekkers commented 1 month ago

Hi,

I've just set up snapshot testing in my project that uses xrust, but unfortunately it would seem that to_xml is currently unsuitable for this purpose, because the serialized order of attributes is not consistent between different parse + to_xml calls:

image

I've set up a repo with a minimal reproducible example: https://github.com/mickdekkers/xrust-serialization-attribute-order-instability-issue-repro

Ideally, the order of attributes in the source XML would be maintained unless the Nodes are explicitly mutated to change this order.

The issue occurs with the code on main as well as dev. The repro points to dev.

Would once again appreciate it if you could take a look! 😄

Many thanks!

ballsteve commented 1 month ago

It sounds like you're after a Canonical XML output. Daniel has added in some support for that, but I think a good design idea would be to make it an option in the OutputDefinition; that way you would use to_xml_with_options(), just like indenting.

mickdekkers commented 1 month ago

Hi @ballsteve, thanks for the suggestion! Canonical XML does seem to cover most1 of what I'm after 😄

However, I just tried replacing line 13 of main.rs to get Canonical XML output:

- let xml_output = doc.to_xml();
+ let xml_output = doc.get_canonical().unwrap().to_xml();

and the attribute order of the serialized XML string is unfortunately still changing:

image

another run output:

image

another:

image

Do you or @Devasta perhaps know what might be causing this? As far as I can tell based on a few dozen runs, it's only the order of attributes that's unstable, but not e.g. the order of element nodes within a parent node.

1 I'm working on an RFC 5261 implementation as one of the libraries used in my project. Implementing full compliance around e.g. whitespace handling on <remove> would be nice, but is a non-goal for now

ballsteve commented 1 month ago

I don't know what's causing that. There are two issues to resolve here:

  1. Make sure that the canonical output conforms to the W3C spec, and
  2. Change the design to include canonical output in the OutputDefinition struct.

Both of those are on the development list. We'll get to them ASAP.

Devasta commented 1 month ago

So I think the issue is with the storage of attributes in Smite.

https://github.com/ballsteve/xrust/blob/2449e4dfe85d613d8d096fed21f7c1e2f768e0f9/src/trees/smite.rs#L73

Hashmaps in Rust do not preserve order.

XML does not care about attribute order, however the XDM Spec says "The order of Attribute Nodes is stable but implementation dependent."

I'll look into using a BTree instead, see if there are any issues.

@mickdekkers Would it be ok if only the output was stable? I don't think we can or want to care about attribute order on the input document.

Devasta commented 1 month ago

https://github.com/ballsteve/xrust/tree/dev_issue_98

Made a change here, let me know if this is sufficient?

mickdekkers commented 1 month ago

@Devasta that's perfect, thanks a lot! I just tried it and the order in to_xml is indeed stable after that change, with my snapshot tests now passing consistently 😄

Would it be ok if only the output was stable? I don't think we can or want to care about attribute order on the input document.

For sure, I don't really care about the attribute order on the input document either. A stable serialized output regardless of order is all I need for testing purposes 👍

XML does not care about attribute order, however the XDM Spec says "The order of Attribute Nodes is stable but implementation dependent."

Good find! I hadn't read the full spec but this seems sensible. I just noticed that the dm:namespace-nodes section has the same note about order stability: "The order of Namespace Nodes is stable but implementation dependent." It looks like that's currently also backed by a HashMap in xrust. I haven't tested it to verify, but I imagine that would make iterating namespaces unstable too. (The documents I'm working with only use a default namespace, so this doesn't affect me, but I thought I'd mention it 🙂)

@ballsteve I'll no longer need Canonical XML for my project after Daniel's change, so please consider my requests in that area retracted, but FWIW those sound like good enhancements 👍

Thanks again!

Cheers, Mick

Devasta commented 1 month ago

Good point, I'll take a look at namespace nodes tonight and see what needs doing there.

Also very interested to hear about your project implementing RFC5261, do share it with us if you can. 😄

Devasta commented 1 month ago

Right, switching the namespace nodes to a BTreeMap from Hashmap is trivial, however I have found another issue with serialization that I have opened as a separate issue https://github.com/ballsteve/xrust/issues/99

Will try to fix that, then I will push both to our dev branch.