ballsteve / xrust

XPath, XQuery, and XSLT for Rust
Apache License 2.0
84 stars 7 forks source link

XML Namespaces #78

Open ballsteve opened 4 months ago

ballsteve commented 4 months ago

For certain transformations, in particular the system-property XSLT function, the transform engine needs to be able to expand QNames, e.g. xsl:version becomes Q{http://www.w3.org/1999/XSL/Transform}version. To do that, the transform engine needs to have a copy of the namespaces declared in the stylesheet (not the source document). Since the transform engine is separate to the stylesheet, or can be created without a stylesheet, I've added a mechanism to the parser that includes a copy of the namespace declarations along with the constructed tree - see parse_with_ns (branch feb2-2024).

This works, except that in the parser/xml/element module, when the code checks the namespace of the element it pops the namespace vector. This means that the namespace declaration is lost when the state is returned to the caller.

I've changed the code so that the namespace vector's last element is cloned, rather than popped. However, I don't know what the consequence of this design will be. I need advice on whether this is a safe change.

There are 4 namespace tests that are failing, but they failed before this change so we haven't gone backwards.

Devasta commented 1 month ago

Hi Steve,

So, I have switched this back, the issue is that the vector gets hashmaps added and removed based on the depth down a tree.

If we just clone the last hashmap instead of popping, then the below document gets parsed successfully.

<doc>
    <a:child1 xmlns:a="namespace1"/>
    <a:child2/><!-- Should Error -->
</doc>

We now have namespace declarations stored as nodes on the source document, but I can add QNames into some other struct if you need?

Devasta commented 1 month ago

Right, I can see why this is needed.

I know its very unlikely someone would do it, but if someone had a document like:

<a:doc xmlns:a="Namespace1">
    <a:child xmlns:a="Namespace2">
        <a:grandchild xmlns:a="Namespace1"/>
    </a:child>
</a:doc>

We wouldn't have the namespaces all together when finished parsing. I think we're going to want to climb up the parent nodes and pull the namespaces from there.

ballsteve commented 1 month ago

XML Namespaces are annoying, since you can get pathologically worst case scenarios like that :-(

Climbing the ancestor nodes looks like the way to go.

Devasta commented 3 weeks ago

Issues #98 and #99 are both related to this, if we can get this one resolved fixing both should be straightforward.

Devasta commented 3 weeks ago

Seems this is causing some performance issues on large documents.

If you have a document like:

<doc>
    <child>a</child>
    <child>a</child>
    <child>a</child>
    <!-- 1000 rows -->
    <child>a</child>
    <child>a</child>
</doc>

Then the namespace vector on the parserstate keeps getting bigger and bigger and is constantly being cloned as the parser tries different functions.

ballsteve commented 3 weeks ago

Hi Daniel,

Sounds like we’ll have to re-think the design for namespaces.

Off the top-of-my-head, perhaps wrap the vector in an Rc so that cloning is cheap? Since the vector must be mutable, use interior mutability?

This is to keep track of in-scope namespaces, so perhaps a different design altogether?

Cheers, Steve

On 23 Aug 2024, at 7:56 AM, Daniel Murphy @.***> wrote:

Seems this is causing some performance issues on large documents.

If you have a document like:

a a a a a

Then the namespace vector on the parserstate keeps getting bigger and bigger and is constantly being cloned as the parser tries different functions.

— Reply to this email directly, view it on GitHub https://github.com/ballsteve/xrust/issues/78#issuecomment-2305818464, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUCP5QNWKCTXNOVQVGZNTXDZSZNCXAVCNFSM6AAAAABHKHG4C2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBVHAYTQNBWGQ. You are receiving this because you authored the thread.

ballsteve commented 2 weeks ago

Hi Daniel,

I’ve been giving namespaces some thought over the weekend. We need to rationalise and consolidate what has been separate efforts to deal with namespaces.

For that purpose I’ve roughed out a design that deals with both namespaces and QualifiedNames. See attached PDF (which will be included in the xrust documentation).

In addition, I’m also thinking that QualifiedNames might be better treated as interned strings. More on that later.

Cheers, Steve

On 23 Aug 2024, at 7:56 AM, Daniel Murphy @.***> wrote:

Seems this is causing some performance issues on large documents.

If you have a document like:

a a a a a

Then the namespace vector on the parserstate keeps getting bigger and bigger and is constantly being cloned as the parser tries different functions.

— Reply to this email directly, view it on GitHub https://github.com/ballsteve/xrust/issues/78#issuecomment-2305818464, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUCP5QNWKCTXNOVQVGZNTXDZSZNCXAVCNFSM6AAAAABHKHG4C2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBVHAYTQNBWGQ. You are receiving this because you authored the thread.

ballsteve commented 2 weeks ago

So we're going to have to redesign and reimplement the whole QualifiedName & XML Namespace subsystem. Design documentation has been added in the doc folder (ns_bench branch). This will take a while to sort out...

Devasta commented 2 weeks ago

Hi Steve,

Apologies, just checking on this one: Assuming we implemented it, how does the transformation step then pick up the namespace information for stuff like system-property and custom functions? We've got a system in place now that'll allow us pull together a list of namespaces from the node already by traversing the parents, couldn't we pull them from that?

The vector on the parserstate can be removed to improve performance, but I'm not sure that requires an overhaul this big in scope.

(This is more a question about how construction of the transformation steps works, I've not paid nearly as much attention to that as I should have!)

ballsteve commented 2 weeks ago

Hi Daniel,

The current design has the HashMap storing namespaces in an Rc, i.e. Rc<HashMap<…>>, so that it can be easily, and cheaply, shared. The idea is to add that to the Transform enum. Because it’s an Rc, that’s just a pointer. I haven’t added it to every variant yet, but I haven’t finished the implementation so we’ll see how we go.

If an application is constructing a Transform directly, without a stylesheet document to copy Namespaces from, then we may add a convenience function to map (prefix, namespace URI) pairs to the required structure.

Cheers, Steve

On 30 Aug 2024, at 3:17 AM, Daniel Murphy @.***> wrote:

Hi Steve,

Apologies, just checking on this one: Assuming we implemented it, how does the transformation step then pick up the namespace information for stuff like system-property and custom functions? We've got a system in place now that'll allow us pull together a list of namespaces from the node already by traversing the parents, couldn't we pull them from that?

The vector on the parserstate can be removed to improve performance, but I'm not sure that requires an overhaul this big in scope.

(This is more a question about how construction of the transformation steps works, I've not paid nearly as much attention to that as I should have!)

— Reply to this email directly, view it on GitHub https://github.com/ballsteve/xrust/issues/78#issuecomment-2318421844, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUCP5QMKNRSBXSOCVZWLW7LZT5JT7AVCNFSM6AAAAABHKHG4C2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJYGQZDCOBUGQ. You are receiving this because you authored the thread.

ballsteve commented 1 week ago

Branch issue_95_96 now has the results of some major surgery.

QualifiedName has been changed to use Rc instead of String. Node names now use Rc. The XML parser interns values and names so that they can be reused.

Running the benchmark results in a 80-90% improvement in the parser(!).

Running the parser on a 100 element document takes, on average, 19.9ms or 0.199ms/element. Running the parser on a 1000 element document takes, on average, 208ms or 0.2ms/element. This is linear performance.

Previously, the 100 element result was 1.2ms/element and for 1000 elements it was 13.2ms/element. Non-linear performance.

That's a great improvement, so I'll flow that change through the rest of the code.

Another potential improvement would be to use global string interning. This would allow string comparisons for equality to be made by comparing pointers. The parser, XPath and XSLT do a lot of equality comparisons. However, let's get this change bedded down first ;-)