eXist-db / exist

eXist Native XML Database and Application Platform
https://exist-db.org
GNU Lesser General Public License v2.1
429 stars 179 forks source link

[BUG] Third parameter to ft:facets() returns incorrect results #4190

Open djbpitt opened 2 years ago

djbpitt commented 2 years ago

Context

Using eXist-db 5.4 snapshot and MacOS installed as dmg.

Describe the bug

The following query correctly returns all facet results:

xquery version "3.1";
declare namespace tei="http://www.tei-c.org/ns/1.0";
declare variable $articles as element(tei:TEI)+ := collection('/db/apps/pr-app/data/hoax_xml')/tei:TEI;
let $hits as element(tei:TEI)+ := $articles[ft:query(
        ., 
        () ,
        map {
            "fields": "publisher"
        }
    )]
let $facets := ft:facets($hits, "publisher")
return $facets

Specifying a third parameter to facets() to limit the number of facets returned produces incorrect results:

let $facets := ft:facets($hits, "publisher")
return count(map:keys($facets))

correctly returns 25. But:

let $facets := ft:facets($hits, "publisher", 10)
return count(map:keys($facets))

also returns 25, and

let $facets := ft:facets($hits, "publisher", 1)
return count(map:keys($facets))

returns 22.

Expected behavior

I expect facets() to return the n facet items with the highest frequencies (or fewer, if the total number of items is less than n), where n equals the third parameter. That is, I expect 1 item if I specify 1, 10 items if I specify 10, etc.

To reproduce

collection.xconf is:

<collection xmlns="http://exist-db.org/collection-config/1.0" xmlns:tei="http://www.tei-c.org/ns/1.0">
    <index xmlns:xs="http://www.w3.org/2001/XMLSchema">
        <!-- Configure lucene full text index -->
        <lucene>
            <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
            <analyzer id="ws" class="org.apache.lucene.analysis.core.WhitespaceAnalyzer"/>
            <text qname="tei:body"/>
            <text qname="tei:placeName"/>
            <text qname="tei:TEI">
                <field name="publisher" expression="(descendant::tei:publicationStmt/tei:publisher[has-children(.)], '[unknown]')[1]"/>
                <facet dimension="publisher" expression="descendant::tei:publicationStmt/tei:publisher"/>
            </text>
        </lucene>
    </index>
</collection>

Data is in a private repo (copyright); Joe Wicentowski has access and was able to confirm the behavior.

Workaround

Per Joe’s advice, including the full collection() path directly in the expression that uses the index, instead of using a variable, gives the correct behavior:

let $hits as element(tei:TEI)+ := collection('/db/apps/pr-app/data/hoax_xml')[ft:query(., (), 10)]

Thoughts

If this behavior reflects an inherent limitation of the optimizer (rather than a bug that can be fixed), it would be helpful to document it. The issue is not only that optimization is not performed; in this case the result returned by the expression is incorrect.

djbpitt commented 2 years ago

Oops! I mistyped the workaround. ft:query() doesn't take the third numerical parameter; that belongs with ft:facets(). A full working example with the workaround is:

xquery version "3.1";
declare namespace tei="http://www.tei-c.org/ns/1.0";
let $hits as element(tei:TEI)+ := collection('/db/apps/pr-app/data/hoax_xml')/tei:TEI[ft:query(., ())]
let $facets := ft:facets($hits, "publisher", 10)
return $facets