[BUG] Execution time differs between different variants of predicates

dariok commented 2 years ago

Describe the bug There is a difference in execution times between different variants of predicates in a query which should actually be the same.

The following data were measured on a set of c. 4.9 million elements with @xml:id. There is an index for @type. The tests used different values for the not() check each time (values in the table below are examples only) which were not used in any location so the resulting set always consisted of 4,880,027 items.

predicate	run 1	run 2	run 3	run 4	average	σ²
`[not(@type = ('t1', 't2') and @xml:id]`	50.84s	48.74s	47.86s	47.55s	48.75s	2.200
`[not(@type = ('t3', 't4')][@xml:id]`	47.55s	46.94s	46.41s	47.51s	47.10s	0.290
`[@xml:id and not(@type = ('t5', 't6')]`	44.75s	44.46s	44.67s	44.87s	44.69s	0.030
`[@xml:id][not(@type = ('t7', 't8')]`	46.00s	45.42s	45.60s	45.39s	44.59s	0.088

Several things can be observed here:

1 and 2 are significantly slower than 3 and 4
variance for 1 and 2 is signinficantly larger than for 3 and 4 and for 1 is much larger than for 2
with the exception of 2, last run, the queries get fast each time for 1 and 2 but that may just be a random thing

Expected behavior While some variance is to be expected, all query times should be roughly the same.

To Reproduce (last run was)

xquery version "3.1";

collection("/db/apps/edoc/data/pd000006/texts/17xx")//*[@xml:id][not(@type = ('t31', 't32'))]

Context (please always complete the following information):

OS: Ubuntu 20.04
eXist-db version: 5.3.0
Java Version: OpenJDK Runtime Environment (build 11.0.14.1+1-Ubuntu-0ubuntu1.20.04)

Additional context

How is eXist-db installed? JAR
Any custom changes in e.g. conf.xml? max. RAM to 6G

joewiz commented 2 years ago

For the purposes of comparing timings, it would be advisable to use consistent parameters - so use t1, t2 throughout all examples. The reason is that, as the tuning article states - https://exist-db.org/exist/apps/doc/tuning#selective - performance is expected to change based on how "selective" each filter is in a sequence of filters.

Update: Reading the description a bit more carefully, I see that the not() filters always return true(), regardless of the t1, t2 values. So I retract my advice!

Did you change these values so that you could achieve "fresh" timings? Was it necessary to change the t1, t2 values?

dariok commented 2 years ago

I used different values each time to avoid obfuscating the underlying problem – i. e. to get fresh timings, indeed. I am not sure whether some form of caching could actually interfere with the timing but I wanted to make sure this is not the case, hence changing the value each time (so, type 1, run 1 was ('t1', 't2') and the final run for type 4 was ('t31', 't32'))

joewiz commented 2 years ago

Did you restart the database when changing the predicates? If not, then eXist's caches from one set of tests may contribute to the subsequent set. To eliminate the cache as a variable, I'd advise to restart when changing a predicate.

eXist-db / exist

[BUG] Execution time differs between different variants of predicates #4361