eXist-db / exist

eXist Native XML Database and Application Platform
https://exist-db.org
GNU Lesser General Public License v2.1
429 stars 180 forks source link

XPaths with `[1]` attribute return wrong results #426

Closed gioele closed 5 years ago

gioele commented 9 years ago

XPath queries that contain a [1] attribute will return only some of the nodes that they should return.

For example, the following query will return only 1 item, the one with ID lemma-AnuvaMSya. Instead it should have returned 4 results, as confirmed by libxml and oXygen.

A note on the query. This is a minimal test case reduction; in this particular case removing the [1] will lead to the expected behaviour, but removing the [1] is not possible in the original query/environment from which this test case has been derived.

xquery version "3.0";

declare namespace tei="http://www.tei-c.org/ns/1.0";

doc('/db/dict/test.tei')//tei:entry
    [./tei:sense//text()
        [contains(., 'aMSa')]
        [ancestor::*[@xml:lang][1]]
    ]
<TEI xmlns="http://www.tei-c.org/ns/1.0" version="5" xml:id="monier" xml:lang="en">
    <text>
        <entry xml:id="lemma-anaMSin" ana="H1">
            <sense>an-aMSa</sense>
        </entry>
        <entry xml:id="lemma-apaBraMSa" ana="H1">
            <sense ana="H1A">ungrammatical language</sense>
        </entry>
        <entry xml:id="lemma-apaSabda" ana="H1">
            <sense> bad or vulgar speech apa-BraMSa</sense>
        </entry>
        <entry xml:id="lemma-AMhaspatya" ana="H1">
            <sense>belonging to the dominion of</sense>
        </entry>
        <entry xml:id="lemma-AnuvaMSya" ana="H1">
            <sense>(fr. <w xml:lang="san-Latn-x-SLP1">anu-vaMSa</w>), belonging
                to a race</sense>
        </entry>
        <entry xml:id="lemma-AnuvaMSya-no-elem" ana="H1">
            <sense>(fr. anu-vaMSa), belonging to a race</sense>
        </entry>
    </text>
</TEI>
jensopetersen commented 9 years ago

@gioele, try removing @xml:lang on the TEI element. To me, what eXist-db returns looks fine. It should only return four elements if four elements between the text nodes containing 'aMSa' and the entry elements have an @xml:lang. Zorba and BaseX return one element as well, no matter whether [1] is included or not, if the @xml:lang on the TEI element is omitted.

gioele commented 9 years ago

@jensopetersen: well, changing the data hides the bug but does not fix the it. :)

In my case the @xml:lang attribute is there for an important reason and cannot be removed. Similarly I cannot remove the [1] because it is key part of a longer XPath.

Regardless of that, given a piece of data and a query, all compliant XPath implementations should return the same data. Either eXist is wrong or Saxon and libxml are. As I said, this is just the shortest test case that demonstrates the problem. The content of the test case does not really matter, the fact that implementations return different results does.

jensopetersen commented 9 years ago

@gioele, I tried the following query in Saxon-PE 9.5.1.5 (in oXygen), in eXist-db, in BaseX, and in Zorba, and they all return 4 entries,

    xquery version "3.0";
    declare namespace tei="http://www.tei-c.org/ns/1.0";
    let $doc :=
    <TEI xmlns="http://www.tei-c.org/ns/1.0" version="5" xml:id="monier" xml:lang="en">
       <text>
          <entry xml:id="lemma-anaMSin" ana="H1">
             <sense>an-aMSa</sense>
          </entry>
          <entry xml:id="lemma-apaBraMSa" ana="H1">
             <sense ana="H1A">ungrammatical language</sense>
          </entry>
          <entry xml:id="lemma-apaSabda" ana="H1">
             <sense> bad or vulgar speech apa-BraMSa</sense>
          </entry>
          <entry xml:id="lemma-AMhaspatya" ana="H1">
             <sense>belonging to the dominion of</sense>
          </entry>
          <entry xml:id="lemma-AnuvaMSya" ana="H1">
             <sense>(fr. <w xml:lang="san-Latn-x-SLP1">anu-vaMSa</w>), belonging
                to a race</sense>
          </entry>
          <entry xml:id="lemma-AnuvaMSya-no-elem" ana="H1">
             <sense>(fr. anu-vaMSa), belonging to a race</sense>
          </entry>
       </text>
    </TEI>
    return
    $doc//tei:entry
        [./tei:sense//text()
            [contains(., 'aMSa')]
            [ancestor::*[@xml:lang][1]]
        ]

The second entry does not have a text node that contains 'aMSa', only an attribute value. So the in-memory execution of this query in eXist-db is OK.

Whether [1] is there or not makes no difference: if there is an @xml:lang, there of course is a first @xml:lang.

What happens if the document is stored? In eXist-db, the strange thing happens that the query does what I think you want it to do. It only picks out the fifth entry. This is why I wanted to clarify your query by paraphrasing it. I of course agree that the result is deterministic and that eXist-db may be wrong in the way it gets the "right" answer.

If you remove the @xml:lang on TEI you can see (by adding entries like the fifth after it) what eXist-db does: it takes the last of the entries if finds, that is, it somehow applies the ancestor axis to entry.

I think it does something which amounts to

$doc//tei:entry
    [./tei:sense//text()
        [contains(., 'aMSa')]
        [ancestor::*[@xml:lang]]
    ][last()]

but perhaps @wolfgangmm has a clearer idea what goes on.

I think you are right: this is a bug.

gioele commented 9 years ago

Just to make things clear: regardless of the meaning of the query, I expect the original query to return 4 entry elements out of 5. The bug is in the fact that it returns only 1.

Also, it is true that for this very case the presence or absence of [1] should not make a difference, but for some strange reason, it does make a difference in current eXist implementation, leading to two different results. Similarly, the fact that a document is stored in a variable or read via doc() should not make a difference in this case, but, again, it does.

jensopetersen commented 9 years ago

Exactly what I attempted to write, @gioele.

kohsah commented 9 years ago

I am having a similar issue. In my case a query returns results only if drop the indexes, adding the indexes makes the query return 0 results. However noticed one odd behaviour, this works :

for $bydatesell in $coll//trade[./scrip/transType['S' = .]]
...

but this fails :

for $bydatesell in $coll//trade[./scrip/transType[. = 'S']]
...
duncdrum commented 5 years ago
xquery version "3.1";
declare namespace tei="http://www.tei-c.org/ns/1.0";

let $test := document {
    <TEI xmlns="http://www.tei-c.org/ns/1.0" version="5" xml:id="monier" xml:lang="en">
    <text>
        <entry xml:id="lemma-anaMSin" ana="H1">
            <sense>an-aMSa</sense>
        </entry>
        <entry xml:id="lemma-apaBraMSa" ana="H1">
            <sense ana="H1A">ungrammatical language</sense>
        </entry>
        <entry xml:id="lemma-apaSabda" ana="H1">
            <sense> bad or vulgar speech apa-BraMSa</sense>
        </entry>
        <entry xml:id="lemma-AMhaspatya" ana="H1">
            <sense>belonging to the dominion of</sense>
        </entry>
        <entry xml:id="lemma-AnuvaMSya" ana="H1">
            <sense>(fr. <w xml:lang="san-Latn-x-SLP1">anu-vaMSa</w>), belonging
                to a race</sense>
        </entry>
        <entry xml:id="lemma-AnuvaMSya-no-elem" ana="H1">
            <sense>(fr. anu-vaMSa), belonging to a race</sense>
        </entry>
    </text>
</TEI>

}

return

$test//tei:entry
    [./tei:sense//text()
        [contains(., 'aMSa')]
        [ancestor::*[@xml:lang][1]]
    ]

returns


<entry xmlns="http://www.tei-c.org/ns/1.0" xml:id="lemma-anaMSin" ana="H1">
    <sense>an-aMSa</sense>
</entry>
<entry xmlns="http://www.tei-c.org/ns/1.0" xml:id="lemma-apaSabda" ana="H1">
    <sense> bad or vulgar speech apa-BraMSa</sense>
</entry>
<entry xmlns="http://www.tei-c.org/ns/1.0" xml:id="lemma-AnuvaMSya" ana="H1">
    <sense>(fr. <w xml:lang="san-Latn-x-SLP1">anu-vaMSa</w>), belonging
                to a race</sense>
</entry>
<entry xmlns="http://www.tei-c.org/ns/1.0" xml:id="lemma-AnuvaMSya-no-elem" ana="H1">
    <sense>(fr. anu-vaMSa), belonging to a race</sense>
</entry>