eXist-db / exist

eXist Native XML Database and Application Platform
https://exist-db.org
GNU Lesser General Public License v2.1
429 stars 179 forks source link

[BUG] Performance problem - Index not being used for a query where it probably should #5043

Open alanpaxton opened 1 year ago

alanpaxton commented 1 year ago

Describe the bug A large corpus with the following index configured:

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:wiki="http://exist-db.org/xquery/wiki" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:ns1="https://ns1.ac.uk/corpus/">
    <create qname="@lemma" type="xs:string"/>
  </index>
</collection>

and re-indexed thus:

xmldb:reindex('/db/corpus')

For a simple query, the index is being used:

xquery version "3.1";

(: Queries a corpus of representative size, ca. 30 million words :)

declare namespace ns1 = "https://ns1.ac.uk/corpus/";

let $corpus := collection('/db/corpus')

(: For testing:
        'word-lo' is a low-medium frequency word
        'word-med' is a medium frequency word
        'word-hi' is also a very high frequency word
    :)
let $starttime as xs:time := util:system-time()
let $searchterm := 'word-lo'
let $all := $corpus//ns1:w[@lemma eq $searchterm]
let $num := fn:count($all)
let $endtime as xs:time := util:system-time()
let $min := fn:minutes-from-duration($endtime - $starttime)
let $sec := fn:seconds-from-duration($endtime - $starttime)

return
    <p searchterm="{$searchterm}" time="{$min}m{$sec}s" num="{$num}">Duration of search for term, with number of results</p>

This gives me the helpful result that range indices for @lemma are indexing effectively, and results are strongly dependent on the frequency of the terms:

<p searchterm="word-lo" time="0m0.104s" num="1014">Duration of search for term, with number of results</p>
<p searchterm="word-med" time="0m1.636s" num="31650">Duration of search for term, with number of results</p>
<p searchterm="word-hi" time="0m13.349s" num="777303">Duration of search for term, with number of results</p>

but using the alternative query (as originally proposed by the owners of the corpus):

let $all := $corpus//*/ns1:w[@lemma eq $searchterm]

and we see the following times:

<p searchterm="word-lo" time="2m19.224s" num="1014">Duration of search for term, with number of results</p>
<p searchterm="word-hi" time="2m35.838s" num="777303">Duration of search for term, with number of results</p>

Expected behaviour The second query should run in a similar time to the first; the fact that it is much slower suggests that the index is not being used in this case.

To Reproduce Unfortunately a copy of a large corpus (30M words) was used to reproduce the results. It may be possible to understand the problem without the corpus. It may be possible for us to supply the corpus privately upon request.

Additional context

joewiz commented 1 year ago

@alanpaxton You can tell if an index is being used to evaluate a query by enabling tracing in monex's profiling pane: http://localhost:8080/exist/apps/monex/profiling.html. Enable tracing, clear, tare, run your query, then refresh and check the index usage tab.

It sounds like the query optimizer isn't recognizing the ability of this query to be optimized. Tips for achieving the greatest optimizations can be found in https://exist-db.org/exist/apps/doc/tuning. Certainly there is room to make improvements to the query optimizer (and corresponding updates to the article).

alanpaxton commented 1 year ago

@alanpaxton You can tell if an index is being used to evaluate a query by enabling tracing in monex's profiling pane: http://localhost:8080/exist/apps/monex/profiling.html. Enable tracing, clear, tare, run your query, then refresh and check the index usage tab.

It sounds like the query optimizer isn't recognizing the ability of this query to be optimized. Tips for achieving the greatest optimizations can be found in https://exist-db.org/exist/apps/doc/tuning. Certainly there is room to make improvements to the query optimizer (and corresponding updates to the article).

Thanks Joe, that's info I didn't have which should help me dig out the problem.

adamretter commented 1 year ago

It would be interesting to know if this PR when merged also makes a difference to your issue @alanpaxton - https://github.com/eXist-db/exist/pull/4989