eXist-db / exist

eXist Native XML Database and Application Platform
https://exist-db.org
GNU Lesser General Public License v2.1
429 stars 179 forks source link

[BUG] Execution time differs between different variants of predicates #4361

Open dariok opened 2 years ago

dariok commented 2 years ago

Describe the bug There is a difference in execution times between different variants of predicates in a query which should actually be the same.

The following data were measured on a set of c. 4.9 million elements with @xml:id. There is an index for @type. The tests used different values for the not() check each time (values in the table below are examples only) which were not used in any location so the resulting set always consisted of 4,880,027 items.

predicate run 1 run 2 run 3 run 4 average σ²
[not(@type = ('t1', 't2') and @xml:id] 50.84s 48.74s 47.86s 47.55s 48.75s 2.200
[not(@type = ('t3', 't4')][@xml:id] 47.55s 46.94s 46.41s 47.51s 47.10s 0.290
[@xml:id and not(@type = ('t5', 't6')] 44.75s 44.46s 44.67s 44.87s 44.69s 0.030
[@xml:id][not(@type = ('t7', 't8')] 46.00s 45.42s 45.60s 45.39s 44.59s 0.088

Several things can be observed here:

Expected behavior While some variance is to be expected, all query times should be roughly the same.

To Reproduce (last run was)

xquery version "3.1";

collection("/db/apps/edoc/data/pd000006/texts/17xx")//*[@xml:id][not(@type = ('t31', 't32'))]

Context (please always complete the following information):

Additional context

joewiz commented 2 years ago

For the purposes of comparing timings, it would be advisable to use consistent parameters - so use t1, t2 throughout all examples. The reason is that, as the tuning article states - https://exist-db.org/exist/apps/doc/tuning#selective - performance is expected to change based on how "selective" each filter is in a sequence of filters.

Update: Reading the description a bit more carefully, I see that the not() filters always return true(), regardless of the t1, t2 values. So I retract my advice!

Did you change these values so that you could achieve "fresh" timings? Was it necessary to change the t1, t2 values?

dariok commented 2 years ago

I used different values each time to avoid obfuscating the underlying problem – i. e. to get fresh timings, indeed. I am not sure whether some form of caching could actually interfere with the timing but I wanted to make sure this is not the case, hence changing the value each time (so, type 1, run 1 was ('t1', 't2') and the final run for type 4 was ('t31', 't32'))

joewiz commented 2 years ago

Did you restart the database when changing the predicates? If not, then eXist's caches from one set of tests may contribute to the subsequent set. To eliminate the cache as a variable, I'd advise to restart when changing a predicate.