eXist-db / exist

eXist Native XML Database and Application Platform
https://exist-db.org
GNU Lesser General Public License v2.1
421 stars 179 forks source link

ngram:contains() bug? #2204

Open djbpitt opened 5 years ago

djbpitt commented 5 years ago

What is the problem

I am filtering a list of titles (in a single auxiliary document) to retain only those that contain a particular substring (using the ngram index). Once I have found the exact auxiliary titles that contain the target substring, I am using those exact titles to filter a collection of manuscript descriptions (each in a separate XML file) to keep only those that have a matching <title> element. I'm using ngram:contains(), rather than an explicit equality test, because it gets me case-insensitivity and yields the correct results (that is, substring matches don't contaminate the results).

xquery version "3.1";
declare namespace tei="http://www.tei-c.org/ns/1.0";
declare variable $mss as document-node()+ := collection('/db/repertorium/mss');
declare variable $auxTitles as element(title)* :=
doc('/db/repertorium/aux/titles_cyrillic.xml')//title;
declare variable $target as xs:string :=
request:get-parameter('target','константин');
let $bgTitles := $auxTitles[ngram:contains(*,$target)]/bg
for $title in $bgTitles
where $mss/descendant::tei:title[ngram:contains(.,$title)]
return $title

it errors out with:

exerr:ERROR XPTY0004: The actual cardinality for parameter 2 does not match the
cardinality declared in the function's signature: ngram:contains($nodes as
node()*, $queryString as xs:string?) node()*. Expected cardinality: zero or one,
got 16. [at line 8, column 52, source: xquery version "3.1"; declare namespace
tei="http://www.tei-c.org/ns/1.0"; declare variable $mss as document-node()+ :=
collection('/db/repertorium/mss'); declare variable $auxTitles as
element(title)* := doc('/db/repertorium/aux/titles_cyrillic.xml')//title;
declare variable $target as xs:string :=
request:get-parameter('target','константин'); let $bgTitles :=
$auxTitles[ngram:contains(*,$target)]/bg for $title in $bgTitles where
$mss/descendant::tei:title[ngram:contains(.,$title)] return $title]

If I'm reading this correctly, ngram:contains() thinks that $title has a cardinality not of 1 (the actual cardinality of $title), but of 16 (the cardinality of $bgTitles, that is, of the sequence variable in the for statement, rather than of the range variable).

If I change the manuscript filtering to use general equality instead of ngram:contains(), as follows:

xquery version "3.1";
declare namespace tei="http://www.tei-c.org/ns/1.0";
declare variable $mss as document-node()+ := collection('/db/repertorium/mss');
declare variable $auxTitles as element(title)* :=
doc('/db/repertorium/aux/titles_cyrillic.xml')//title;
declare variable $target as xs:string :=
request:get-parameter('target','константин');
let $bgTitles := $auxTitles[ngram:contains(*,$target)]/bg
for $title in $bgTitles
where $mss/descendant::tei:title[lower-case(.) = lower-case($title)]
return $title

it returns the expected results.

What did you expect

I expect that the range variable in a for statement will always have a cardinality of 1, and that, therefore, ngram:contains() will never raise a cardinality error about its second argument when the second argument is the range variable in a for statement.

Describe how to reproduce or add a test

I would be happy to make my data available on request, but if the issue is, in fact, a bug in the implementation of ngram:contains(), it should be reproducible with other data.

Context information

Please always add the following information

duncdrum commented 5 years ago

@djbpitt to properly asses this we need more info, including index configuration, sample data etc. Could you try to expand this into either a self-contained XQSuite test, that reproduces the problem, or alternatively share a minimal xar that contains all, and only, the files necessary to reproduce this. Thx

merenyics commented 5 years ago

@djbpitt are you sure let $bgTitles := $auxTitles[ngram:contains(*,$target)]/bg is correct? at first glance it seems it should be let $bgTitles := $auxTitles[ngram:contains(.,$target)]/bg (. instead of *) unless you really mean any element under title? (not that it explains your problem, just wondering...)