eXist-db / exist

eXist Native XML Database and Application Platform
https://exist-db.org
GNU Lesser General Public License v2.1
430 stars 180 forks source link

[BUG] Ordering of in-memory nodes is not preserved #4918

Open joewiz opened 1 year ago

joewiz commented 1 year ago

Describe the bug

In certain circumstances, eXist loses the ordering of in-memory nodes.

Expected behavior

I expected eXist to retain the ordering of in-memory nodes.

To reproduce

As demonstrated in the simple example and full XQSuite test suite below, adapted from @KitWallace's report on exist-open, eXist's in-memory ordering of parameters (e.g., when constructing a sequence of items ($ordered, $defined)) is scrambled. The same query in his email works as expected in BaseX and Saxon and fails only in eXist.)

Simple query:

xquery version "3.1";

declare function local:topological-sort($unordered, $ordered)   {
    if (empty($unordered)) then 
        $ordered
    else
        let $defined := $unordered [ every $id in ref/@id satisfies $id = $ordered/@id ]
        return
            local:topological-sort( $unordered except $nodes, ($ordered, $defined) )
};

let $graph :=
    <graph>
        <node id="a">
            <ref id="b"/>
            <ref id="c"/>
        </node>
        <node id="b">
            <ref id="c"/>
        </node>
        <node id="c"/>
    </graph>
return 
    local:topological-sort($graph/node, ())

BaseX and Saxon return:

<node id="c"/>
<node id="b"><ref id="c"/></node>
<node id="a"><ref id="b"/><ref id="c"/></node>

eXist returns:

<node id="b"><ref id="c"/></node>
<node id="c"/>
<node id="a"><ref id="b"/><ref id="c"/></node>

XQSuite

xquery version "3.1";

module namespace t="http://exist-db.org/xquery/test";

declare namespace test="http://exist-db.org/xquery/xqsuite";

declare variable $t:ABC :=
    <node id="a">
        <ref id="b"/>
        <ref id="c"/>
    </node>;

declare variable $t:BC :=
    <node id="b">
        <ref id="c"/>
    </node>;

declare variable $t:C := 
    <node id="c"/>;

declare function local:topological-sort($unordered, $ordered) {
    util:log(
        "INFO",
        string-join(
            (
                if (empty($ordered)) then 
                    "=== STARTING SORT OF " || count($unordered) || " NODES ==="
                else
                    "CONTINUING - " || count($unordered) || " NODES TO GO",
                serialize(
                    <nodes> 
                        <unordered>{$unordered}</unordered>
                        <ordered>{$ordered}</ordered>
                    </nodes>, 
                    map { "indent": true() }
                )
            ),
            "&#10;"
        )
    ),
    if (empty($unordered)) then 
        $ordered
    else
        let $defined := $unordered[ every $id in ref/@id satisfies $id = $ordered/@id ]
        return
            local:topological-sort( 
                $unordered except $defined, 
                ($ordered, $defined)
            )
};

declare
    %test:assertEquals('<node id="c"/>')
function t:sort-1() {
    local:topological-sort($t:C, ())
};

declare
    %test:assertEquals('<node id="c"/>')
function t:sort-1-wrapped() {
    local:topological-sort(<x>{$t:C}</x>/*, ())
};

declare
    %test:assertEquals('<node id="c"/>', '<node id="b"><ref id="c"/></node>')
function t:sort-2() {
    local:topological-sort(($t:BC, $t:C), ())
};

declare
    %test:assertEquals('<node id="c"/>', '<node id="b"><ref id="c"/></node>')
function t:sort-2-wrapped() {
    local:topological-sort(<x>{$t:BC, $t:C}</x>/*, ())
};

declare
    %test:assertEquals('<node id="c"/>', '<node id="b"><ref id="c"/></node>', '<node id="a"><ref id="b"/><ref id="c"/></node>')
function t:sort-3() {
    local:topological-sort(($t:ABC, $t:BC, $t:C), ())
};

declare
    %test:assertEquals('<node id="c"/>', '<node id="b"><ref id="c"/></node>', '<node id="a"><ref id="b"/><ref id="c"/></node>')
function t:sort-3-wrapped() {
    local:topological-sort(<x>{$t:ABC, $t:BC, $t:C}</x>/*, ())
};

The XQSuite results:

<testsuite package="http://exist-db.org/xquery/test" timestamp="2023-05-11T14:12:07.337-04:00" tests="6" failures="1" errors="0" pending="0" time="PT0.016S">
    <testcase name="sort-1" class="t:sort-1"/>
    <testcase name="sort-1-wrapped" class="t:sort-1-wrapped"/>
    <testcase name="sort-2" class="t:sort-2"/>
    <testcase name="sort-2-wrapped" class="t:sort-2-wrapped"/>
    <testcase name="sort-3" class="t:sort-3"/>
    <testcase name="sort-3-wrapped" class="t:sort-3-wrapped">
        <failure message="assertEquals failed." type="failure-error-code-1">&lt;node id="c"/&gt; &lt;node id="b"&gt;&lt;ref id="c"/&gt;&lt;/node&gt; &lt;node id="a"&gt;&lt;ref id="b"/&gt;&lt;ref id="c"/&gt;&lt;/node&gt;</failure>
        <output>
            <node id="b">
                <ref id="c"/>
            </node>
            <node id="c"/>
            <node id="a">
                <ref id="b"/>
                <ref id="c"/>
            </node>
        </output>
    </testcase>
</testsuite>

In other words, the problem only occurs when (a) we are operating on 3 items and (b) the input is constructed by wrapping the 3 items into an element and selecting its child nodes.

The logging demonstrates the point during the processing when the problem appears:

=== STARTING SORT OF 3 NODES ===
<nodes>
    <unordered>
        <node id="a">
            <ref id="b"/>
            <ref id="c"/>
        </node>
        <node id="b">
            <ref id="c"/>
        </node>
        <node id="c"/>
    </unordered>
    <ordered/>
</nodes> 

CONTINUING - 2 NODES TO GO
<nodes>
    <unordered>
        <node id="a">
            <ref id="b"/>
            <ref id="c"/>
        </node>
        <node id="b">
            <ref id="c"/>
        </node>
    </unordered>
    <ordered>
        <node id="c"/>
    </ordered>
</nodes> 

CONTINUING - 1 NODES TO GO
<nodes>
    <unordered>
        <node id="a">
            <ref id="b"/>
            <ref id="c"/>
        </node>
    </unordered>
    <ordered>
        <node id="c"/>
        <node id="b">
            <ref id="c"/>
        </node>
    </ordered>
</nodes> 

CONTINUING - 0 NODES TO GO
<nodes>
    <unordered/>
    <ordered>
        <node id="b">
            <ref id="c"/>
        </node>
        <node id="c"/>
        <node id="a">
            <ref id="b"/>
            <ref id="c"/>
        </node>
    </ordered>
</nodes> 

Notice how, with 1 node to go, the "ordered" nodes are correctly sorted:

  1. <node id="c"/>
  2. <node id="b"><ref id="c"/></node>

We'd expect the final run to sort the remaining node into the end - as the 3rd item. But in fact, the order is scrambled:

  1. <node id="b"><ref id="c"/></node>
  2. <node id="c"/>
  3. <node id="a"><ref id="b"/><ref id="c"/></node>

I tried to trim this down to an even smaller reproducible example - sorting nodes into alphabetical order based on an attribute - but this variation didn't trigger the issue:

xquery version "3.1";

declare function local:sort($unsorted, $sorted) {
    if (empty($unsorted)) then 
        $sorted
    else
        let $smallest := min($unsorted/@id/string())
        return
            local:sort($unsorted[@id ne $smallest], ($sorted, $unsorted[@id eq $smallest]))
};

let $nodes := (<x id="c"/>, <x id="b"/>, <x id="a"/>)
return
    local:sort(<y>{$nodes}</y>/*, ())

The results are returned in the correct order (<x id="a"/>, <x id="b"/>, <x id="c"/>), instead of a scrambled order as above.

As a result, I think we have to conclude that something about the particular combination of factors shown in the XQSuite above - the particular sorting being done in local:topological-sort, together with the use of a path expression to supply parameters to the function - contributes to eXist's mangling of the order of the in-memory nodes.

Context (please always complete the following information)

Build: eXist-6.2.0 (c8fa4958b6d4a50bd0cba7f3e76a150226414187) Java: 1.8.0_372 (Temurin) OS: Mac OS X 13.3.1 (x86_64)

Additional context

adamretter commented 1 year ago

At a guess, as this happens during the last step, I might think that there is possibly a bad optimisation at play.

There are a lot of different expressions at play in the reproducible example. It could be simpler to debug this if it used for expressions instead of every/in/satisfies and except expressions. Do you think it would be possible to rewrite this a little?

joewiz commented 1 year ago

Re "this happens during the last step," I had the same idea, so I checked and confirmed that the results are identical if the "sort-3" test is moved to the final position; only "sort-3-wrapped" fails.

Yes, I think I could rewrite this to avoid every/in/satisfies/except.

adamretter commented 1 year ago

only "sort-3-wrapped" fails.

Fascinating! :-)

joewiz commented 1 year ago

Here's a version of the local:topological-sort function without the use of every/in/satisfies/except, and plus a variation of the simple query that lets you define how many nodes you want to see. Every test with more than 2 nodes (i.e., a $how-many > 2) fails in eXist. The same tests all pass in BaseX and Saxon.

xquery version "3.1";

declare function local:topological-sort($unordered, $ordered)   {
    if (empty($unordered)) then 
        $ordered
    else
        let $defined := 
            for $node in $unordered
            let $id-checks := 
                for $id in $node/ref/@id
                return
                    $id = $ordered/@id
            return
                if (count($id-checks[. eq false()]) ge 1) then 
                    ()
                else
                    $node
        return
            local:topological-sort( $unordered[not(@id = $defined/@id)], ($ordered, $defined) )
};

let $alphabet := (string-to-codepoints("a") to string-to-codepoints("z")) ! codepoints-to-string(.)
for $how-many in 1 to 10
(:let $how-many := 3:)
let $letters := reverse(subsequence($alphabet, 1, $how-many))
let $nodes := 
    for $size in 1 to $how-many
    let $ids := reverse(subsequence($letters, 1, $size))
    let $node-id := head($ids)
    let $ref-ids := tail($ids)
    return
        <node id="{$node-id}">
            {
                for $ref-id in $ref-ids
                return
                    <ref id="{$ref-id}"/>
            }
        </node>
let $original := <graph>{$nodes}</graph>
let $scrambled := <graph>{reverse($nodes)}</graph>
let $actual := <graph>{local:topological-sort($scrambled/node, ())}</graph>
let $passes := deep-equal($actual, $original)
let $result := 
    <result how-many="{$how-many}" passes="{$passes}">
        <scrambled>{$scrambled}</scrambled>
        <expected>{$original}</expected>
        <actual>{$actual}</actual>
    </result>
return 
    if ($passes) then 
        ()
    else
        $how-many
(:        $result:)

For example, setting $how-many to 5 yields this result:

<result how-many="5" passes="false">
    <scrambled>
        <graph>
            <node id="a">
                <ref id="b"/>
                <ref id="c"/>
                <ref id="d"/>
                <ref id="e"/>
            </node>
            <node id="b">
                <ref id="c"/>
                <ref id="d"/>
                <ref id="e"/>
            </node>
            <node id="c">
                <ref id="d"/>
                <ref id="e"/>
            </node>
            <node id="d">
                <ref id="e"/>
            </node>
            <node id="e"/>
        </graph>
    </scrambled>
    <expected>
        <graph>
            <node id="e"/>
            <node id="d">
                <ref id="e"/>
            </node>
            <node id="c">
                <ref id="d"/>
                <ref id="e"/>
            </node>
            <node id="b">
                <ref id="c"/>
                <ref id="d"/>
                <ref id="e"/>
            </node>
            <node id="a">
                <ref id="b"/>
                <ref id="c"/>
                <ref id="d"/>
                <ref id="e"/>
            </node>
        </graph>
    </expected>
    <actual>
        <graph>
            <node id="b">
                <ref id="c"/>
                <ref id="d"/>
                <ref id="e"/>
            </node>
            <node id="c">
                <ref id="d"/>
                <ref id="e"/>
            </node>
            <node id="d">
                <ref id="e"/>
            </node>
            <node id="e"/>
            <node id="a">
                <ref id="b"/>
                <ref id="c"/>
                <ref id="d"/>
                <ref id="e"/>
            </node>
        </graph>
    </actual>
</result>

I've left some commenting in that lets you switch to a single result or lets you see the full result for each hit.

KitWallace commented 1 year ago

Looking for a workaround I found this works - wrapping the sequence in an element

declare function local:topological-sort($unordered, $ordered-graph)   {
    if (empty($unordered))
    then $ordered-graph/*
    else
        let $ordered := $ordered-graph/*
        let $nodes := $unordered [ every $id in ref/@id satisfies $id = $ordered/@id]
        return
          if ($nodes)
          then let $ordered-graph :=  element graph {($ordered , $nodes)}
                  return local:topological-sort($unordered except $nodes, $ordered-graph)
          else ()  (: cycles so no order possible :)
};
adamretter commented 1 year ago

@joewiz Excellent thanks. For testing and debugging purposes the less amount of code and variance in expressions the better, so let's go for this:

xquery version "3.1";

declare function local:topological-sort($unordered, $ordered)   {
    if (empty($unordered)) then 
        $ordered
    else
        let $defined := 
            for $node in $unordered
            let $id-checks := 
                for $id in $node/ref/@id
                return
                    $id = $ordered/@id
            return
                if (count($id-checks[. eq false()]) ge 1) then 
                    ()
                else
                    $node
        return
            local:topological-sort( $unordered[not(@id = $defined/@id)], ($ordered, $defined) )
};

local:topological-sort(
    <graph>
        <node id="a">
            <ref id="b"/>
            <ref id="c"/>
        </node>
        <node id="b">
            <ref id="c"/>
        </node>
        <node id="c"/>
    </graph>/node
    ,
    ()
)
adamretter commented 1 year ago

I may have made a mistake, I would certainly appreciate further sets of eyes, but I think that local:topological-sort can be simplified further:

Reduction 1

declare function local:topological-sort($unordered, $ordered)   {
    if (empty($unordered)) then 
        $ordered
    else
        let $defined := 
            for $node in $unordered
            return
                if (exists($node/ref[not(@id = $ordered/@id)])) then 
                    ()
                else
                    $node
        return
            local:topological-sort( $unordered[not(@id = $defined/@id)], ($ordered, $defined) )
};

Reduction 2

declare function local:topological-sort($unordered, $ordered)   {
    if (empty($unordered)) then 
        $ordered
    else
        let $defined := $unordered[empty(ref[not(@id = $ordered/@id)])]
        return
            local:topological-sort( $unordered[not(@id = $defined/@id)], ($ordered, $defined) )
};
joewiz commented 1 year ago

I can confirm that both "reduction" versions produce the same results as earlier variations, both in the "direct" form and the "xqsuite" form.

Also, with these "simple" variants of the test (i.e., not the xqsuite version above), for some reason it isn't necessary to wrap the nodes in a "graph" element to produce the incorrect results. Just passing the "node" elements to the sort function (instead of wrapping them in a "graph" element and selecting its child elements) produces the scrambled results.

xquery version "3.1";

declare function local:topological-sort($unordered, $ordered) {
    if (empty($unordered)) then 
        $ordered
    else
        let $defined := $unordered[empty(ref[not(@id = $ordered/@id)])]
        return
            local:topological-sort( $unordered[not(@id = $defined/@id)], ($ordered, $defined) )
};

local:topological-sort(
    (
        <node id="a">
            <ref id="b"/>
            <ref id="c"/>
        </node>,
        <node id="b">
            <ref id="c"/>
        </node>,
        <node id="c"/>
    )
    ,
    ()
)

... returns:

<node id="b">
    <ref id="c"/>
</node>
<node id="c"/>
<node id="a">
    <ref id="b"/>
    <ref id="c"/>
</node>

... whereas BaseX and Saxon return:

<node id="c"/>
<node id="b">
  <ref id="c"/>
</node>
<node id="a">
  <ref id="b"/>
  <ref id="c"/>
</node>
adamretter commented 1 year ago

I had a little time to look into this... The incorrect ordering of results are being returned because behind the scenes the directly constructed XML nodes (in-memory nodes) are placed into a ValueSequence. The problem is that the ValueSequence code frequently tries to sort things into document order (for unknown reasons - https://github.com/eXist-db/exist/blob/develop-6.x.x/exist-core/src/main/java/org/exist/xquery/value/ValueSequence.java#L429. It is this sorting that causes the described problem, if I comment out the sorting then you set the expected results from eXist-db that match BaseX and Saxon.

I am afraid that this is another instance of where someone did something in the past in the eXist-db code (probably for some sensible reason at the time), that doesn't make much sense when looked at today. There is also no documentation explaining why it was done that way, and it doesn't follow the XQuery standards. ValueSequece is used in many places in eXist-db's XQuery Engine, and unfortunately the ValueSequence implementation and (its use) look like a complete minefield to me. Therefore it is hard to know what existing behaviour may be fixed and broken by removing this sorting.


On a side note - Hilariously the above code results in FastQSort.sort(...) being executed some 54 times! When it need be executed 0 times. As sorting of data is a known slow operation in computer science, it is likely that fixing (or more likely implementing a replacement for ValueSequence) could yield serious performance gains across the board for all queries in eXist-db. This is no small undertaking though, as it would require both implementation of a comprehensive test suite, and careful analysis of W3C XQTS results to ensure no regressions.