Open joewiz opened 4 years ago
hmm i see the same performance curve as you. I tried to refactor the dual for clause to actually use two for
keywords, since i remember that made a huge difference, in an older version, but this is no longer the case.
sooo ouch: 🐢
One possible optimisation baseX could use here is memoizing the function calls to local:compare
or even for-each-pair
.
The code below mimics this by first creating a map of all possible values. That way only a map-lookup is needed and comparing 100.000 items takes under two seconds in exist 4.7.1
xquery version "3.1";
declare function local:key-function ($items) { string-join($items, '') };
let $length := 100000
let $sequence-of-maps := (1 to $length) ! map { "n": ., "values": ("a" || ., "b" || ., "c" || .) }
let $compare-map :=
for-each($sequence-of-maps, function ($item) {
map { local:key-function($item?values) : true() }
})
=> map:merge()
for $item in $sequence-of-maps
let $key := local:key-function($item?values)
let $alignments := map:contains($compare-map, $key)
return
if ($alignments)
then ($item)
else ()
I am confident your original problem could be rewritten in this way. But also we should investigate if baseX is in fact memoizing function calls or if this would, independently from what they do, help your and similar use cases.
The runtimes for an isolated test case of for-each-pair
reveals no big performance gaps between exist 4.7.1 (1.9s) and baseX 9.3.1 (1.8s)
xquery version "3.1";
let $length := 200000
let $seq := (1 to $length)
return for-each-pair($seq, $seq, function ($a, $b) { $a = $b })
@line-o Thanks very much for your analysis. Your insight about using a key to workaround the performance limitations reported here helped me find a faster solution for finding the candidates: grouping by the key.
xquery version "3.1";
let $length := 1000
let $sequence-of-maps := (1 to $length) ! map { "n": ., "values": ("a" || util:random(100), "b" || util:random(100), "c" || util:random(100)) }
let $keyed-sequence :=
for-each($sequence-of-maps, function ($item) {
map:put($item, "key", string-join($item?values))
})
for $items in $keyed-sequence
group by $key := $items?key
where count($items) gt 1
return
array { $items }
(As you might notice, the revised query uses util:random()
to generate random values to populate the sequence of maps. This clarifies an unstated assumption in my original query: that the purpose of the query is to discover sets of 2+ items in the sequence with identical values.)
The resulting query performs this task much faster, and I think it is adequate for my purposes.
That said, I should note that the performance is correspondingly faster in BaseX too. As a result, we still see a similar shape when comparing the performance of the two processors (this shows the average of 5 runs at different values of the $length
parameter):
To get the query to run in BaseX, instead of substituting random:integer()
for util:random()
, I adapted the query to use fn:random-number-generator()
, which is cross-platform, performed much better in eXist (but requires 5.3.0-SNAPSHOT which includes @adamretter's great fix from https://github.com/eXist-db/exist/pull/3072), and avoided heap space errors that appeared in eXist with $length
set to 100,000. That said, with $length
set to 100,000, eXist finished in 306s, whereas BaseX finished in .96s.
Here are the queries I used to profile eXist and BaseX's performance: profile-xquery-exist-basex.zip.
In sum, the difference in performance of both the original query reported in the first post and the modified ones here suggest that there eXist still has some performance and scalability potential with this type of query.
@joewiz That is very insightful research! I think it is evident that our implementation of fn:for-each-pair
has nothing to do with the performance issue. I would suggest to rename the issue to reflect the new data you delivered.
... and we need to take a look at the FLWOR implementation. Would you have time and energy to compare execution times with a "gluten-free" version? One that uses HOFs to iterate over the dataset instead.
@line-o Good point. I've renamed the issue to the best of my understanding. And going "gluten-free" is an interesting suggestion. To clarify, you're suggesting that I replace any FLWOR-based iterations with HOF, right? Is that because you suspect the problem is with eXist's FLWOR implementation?
Well, I just remembered I did some testing on this. It could be even slower. And then I tested it... Oh my, try it yourself @joewiz
xquery version "3.1";
import module namespace xbow="http://line-o.de/xq/xbow";
let $length := 10000
let $sequence-of-maps := (1 to $length) ! map { "n": ., "values": ("a" || util:random(100), "b" || util:random(100), "c" || util:random(100)) }
let $keyed-sequence :=
for-each($sequence-of-maps, function ($item) {
map:put($item, "key", string-join($item?values))
})
return xbow:groupBy($keyed-sequence, xbow:pluck("key"))
=> map:for-each(function ($k, $v) {
if (count($v) > 1) then array{$v} else ()
})
@line-o Compared to the equivalent query in my post from yesterday, this one is considerably faster:
length | FLWOR + group by | xbow:groupBy + map:for-each |
---|---|---|
1000 | .062 | .052 |
2000 | .236 | .08 |
10000 | 6.1 | .4 |
20000 | * | .8 |
30000 | ** | 1.1 |
50000 | ** | 1.3 |
100000 | ** | 4.3 |
* CPU pegged, never finishes, monex won't load, have to force-quit eXist
** Didn't even try
This comparison suggests a clear win for xbow's HOF approach, whereas eXist's standard group by
completely crumbles past length=10000.
(This "crumbling" phenomenon alone deserves our attention - can we develop some method to kill runaway queries? Can we agree that we should never have to force-quit eXist?)
As I probed the results of my "test runner" that performs 5 runs at many different "lengths" and produces the data used in the charts above, I realized that the step of generating the items itself was taking up a considerable portion of the time for each test run. Since generating dummy data is not the core concern in this issue, I decided to adjust my test to read from pre-generated array of 100,000 items in a static file, and perform the grouping operation on subsets of this. This way, we can be confident that the results let us see how each processor and grouping algorithm performs. Here are the results:
Questions:
Here are the queries, including both the "simple" scenario and the "test runner" scenario: profile-xquery-exist-basex-v2.zip
p.s. If it helps, here are the results from monex's profiling of the "test runner" scenarios:
group by
xbow
@joewiz I did some optimisations that only times the operation itself. Both approaches are part of testbed.xq . Set up code (loading data and keying it) are not measured as part of the run. The results better resemble, what I concluded in my earlier tests a year ago: runtime of HOFs compared to FLWOR are roughly 2:1
@line-o That gist is not the query you are looking for ⚡
@duncdrum the link should be fixed now.
What I find very interesting is that the speed a xquery script finishes can change quite dramatically, even when doing only minor changes to the overall setup. You really have to aim for the sweet spot of the optimiser.
Before anyone reads this, you might want to take a seat! The difference between the results I reported in my last post and what's below blew me away.
Here is the updated chart using @line-o's "testbed" approach:
Observations about the test runner:
Observations about "flwor" vs. "hof":
Implications:
docker-compose.yml
while docker itself has a performance overhead at least we have a controlled environment that is quick and easy to use. Maybe create an expath app wrapper with some instructions and a template that is preinstalled in the performance test instances. Finally we want something that can crash and be restarted/reset hence docker makes it worthwhile the overhead imv. @joewiz Since I have a very light laptop, where I performed my tests I would now focus on threading. How many (real) cores your existdb is able to use? Background: you report 6:1 whereas I see more 2(2,5):1 runtime ratios.
What is really unclear to me: Why your average runtime for 100K items with HOFs is over 600ms but it is ~490ms on my machine.
@duncdrum Thanks for your feedback on the big picture questions!
@line-o I'll send you my machine info via DM.
Can we give dba users a swift, safe, and reliable mechanism to kill queries, so users are never put in the situation of having to force-quit eXist (and thus trigger a recovery run, etc.)?
Not easily... The way that queries are implemented in eXist-db means that they themselves have to check whether a request has been made to terminate them. They only do this between operations, they can't easily do it when they are in a tight loop of processing. The way that queries are executed and managed would need some serious re-design to get it working like you have described.
Are certain expressions or functions not working with the watchdog?
Likely, the developer has to remember to insert calls to check whether the query should be terminated. It's likely there are not as many of these checks as you would like.
Queries in this issue were all in-memory and not writing to the database. They should be eminently haltable. (If the topic of "killing queries" in general raises worries about database inconsistency, could we learn from the XQuery Update spec and provide rollback protections for updating functions and functions with the %updating annotation?
Stopping in-memory queries should be fine. Stopping queries that are writing to the db is very dangerous in eXist-db as there is no real transaction abort/roll-back protection.
I think this issue is very interesting, but it probably needs to be broken into lots of separate issues. The problem is that re-writing queries and changing approaches is fine as a work around, but it just moves the target when trying to find the performance issue. I think this issue likely shows lots of different performance issues.
Describe the issue
As reported on Slack, I developed a query that eXist runs 10x slower than BaseX. In order to aid the core developers in isolating the bottleneck, I reduced the more complex query to the essential part that demonstrates the same performance characteristics, as follows:
Increasing the
$length
variable causes eXist's execution time to increase sharply. Here are the performance times I recorded (in seconds):$length
Graphing this, it became clear that eXist's performance degrades along a far sharper curve, while BaseX's performance declines more slowly:
Note that, of the eXist builds here, the one that performed the best was 5.3.0-SNAPSHOT with PR #3363.
What factors are contributing to the performance bottleneck in eXist? Is there a way to overcome them?
Context (please always complete the following information):
Additional context
conf.xml
? None