Open matcho opened 2 years ago
Confirmed in 3.11.6.
Any update on this please ?
@matcho : I will have a look at the issue today
I created the following dataset for the analysis:
c = db._create("z_obs");
docs = [];
for (i = 0; i < 3500000; ++i) {
docs.push({
_key: "test" + i,
date_created: "20240613" + i,
license: (i <= 2600000 ? "cc-by-sa" : (i < 28000000 ? "cc-by-nc" : "piff"))
});
if (docs.length === 10000) {
c.insert(docs);
docs = [];
}
}
c.ensureIndex({ name: "license-date", type: "persistent", fields: ["license", "date_created"] });
c.ensureIndex({ name: "date-license", type: "persistent", fields: ["date_created", "license"] });
Running the query without any index hint produces the following execution plan with 3.12. The query is using the index on ["date_created", "license" ]
:
> db._profileQuery(`FOR o IN z_obs FILTER o.license IN [ "cc-by-sa", "foobar", "cc-by-nc" ] SORT o.date_created DESC LIMIT 0, 100 RETURN o._key`);
Execution plan:
Id NodeType Calls Par Items Filtered Runtime [s] Comment
1 SingletonNode 1 - 1 0 0.00001 * ROOT
10 IndexNode 1 - 100 0 0.00017 - FOR o IN z_obs /* reverse persistent index scan, index only (filter projections: `license`) */ FILTER (o.`license` IN [ "cc-by-sa", "foobar", "cc-by-nc" ]) /* early pruning */ /* with late materialization */
11 MaterializeNode 1 - 100 0 0.00019 - MATERIALIZE o /* (projections: `_key`) */
7 LimitNode 1 - 100 0 0.00001 - LIMIT 0, 100
8 CalculationNode 1 0 100 0 0.00003 - LET #3 = o.`_key` /* attribute expression */ /* collections used: o : z_obs */
9 ReturnNode 1 - 100 0 0.00001 - RETURN #3
Indexes used:
By Name Type Collection Unique Sparse Cache Selectivity Fields Stored values Ranges
10 date-license persistent z_obs false false false 100.00 % [ `date_created`, `license` ] [ ] *
Query Statistics:
Writes Exec Writes Ign Doc. Lookups Scan Full Scan Index Cache Hits/Misses Filtered Peak Mem [b] Exec Time [s]
0 0 100 0 100 0 / 0 0 0 0.00086
Forcing the query to use the index on ["license", "date_created"]
also works, but it will make the query run significantly slower with my dataset:
> db._profileQuery(`FOR o IN z_obs OPTIONS {indexHint: "license-date", forceIndexHint: true} FILTER o.license IN [ "cc-by-sa", "foobar", "cc-by-nc" ] SORT o.date_created DESC LIMIT 0, 100 RETURN o._key`);
Execution plan:
Id NodeType Calls Par Items Filtered Runtime [s] Comment
1 SingletonNode 1 - 1 0 0.00001 * ROOT
10 IndexNode 3501 - 3500000 0 1.00962 - FOR o IN z_obs /* persistent index scan, index only (projections: `date_created`) */ LET #2 = o.`date_created` /* with late materialization */
6 SortNode 1 - 100 0 1.14253 - SORT #2 DESC /* sorting strategy: constrained heap */
7 LimitNode 1 - 100 0 0.00001 - LIMIT 0, 100
11 MaterializeNode 1 - 100 0 0.00016 - MATERIALIZE o
8 CalculationNode 1 0 100 0 0.00003 - LET #3 = o.`_key` /* attribute expression */ /* collections used: o : z_obs */
9 ReturnNode 1 - 100 0 0.00002 - RETURN #3
Indexes used:
By Name Type Collection Unique Sparse Cache Selectivity Fields Stored values Ranges
10 license-date persistent z_obs false false false 99.98 % [ `license`, `date_created` ] [ ] (o.`license` IN [ "cc-by-nc", "cc-by-sa", "foobar" ])
Query Statistics:
Writes Exec Writes Ign Doc. Lookups Scan Full Scan Index Cache Hits/Misses Filtered Peak Mem [b] Exec Time [s]
0 0 3500000 0 3500000 0 / 0 0 196608 2.15386
With the dataset I used, doing a reverse index scan on the index on ["date_created", "license"]
an then filtering out the index entries with a non-matching license is a good choice.
The query can scan the index to get all documents sorted by date_created
, so it can avoid sorting all matches in memory. Additionally the query needs to post-filter the index entries using the license
attribute, which is also covered by the index. The query can stop processing once it has found 100 matching index entries.
So the performance of this query will depend on how frequent the searched-for license values appear in the dataset. If they appear often and are somewhat equally distributed, this query will be very fast. Looking for a non-existing license will mean however that all index entries will be scanned and the post-filter will be applied to all of them, resulting in no matches. This will mean the worst case performance of the query is O(n), where n is the number of index entries.
When using the other index on ["license", "date_created"]
, the query can use the index to efficiently find all matching entries for the search-for licenses. If the filter condition matches for many documents, then all their date_created
values will be extracted to memory, so that an overall sorted result can be established based on the documents' date_created
values. The best case for this query is when no documents match the filter condition. In this case, it will only do the index lookups, no sorts and no document materialization. However, if the filter condition matches documents, then the query will need to do actual work, and it will get slower with more matches. The worst case performance of this query is also O(n), when n is the number of index entries. This will only happen however if all documents match the filter condition.
So the takeaway is that using the ["date_created", "license"]
index is optimal when looking for frequently-occurring license
values. And when looking for infrequently occurring license
values or even non-existing ones, it will be very slow.
On the opposite, using the index on ["license", "date_created"]
is good when looking for infrequently occurring or non-existing licenses. And using that index is bad when looking for very frequently occurring licenses.
So the optimal plan depends on which lookup values are used for the query. The indexes in ArangoDB do not have cardinality estimates for different lookup values, but they only maintain a global "selectivity estimate". Thus it is impossible for the ArangoDB query optimizer at the moment to tell how frequently occurring different lookup values actually are. It will assume that all lookup values occur equally often.
In case it is known in advance how frequently a value occurs in the index, the application could use an index hint to push query plan selection in the right direction. In case the frequency is not known, the optimizer will likely prefer the index that avoids the post-sorting, at least it did in my case.
Hello @jsteemann ,
First, thank you very much for investigating on this case.
We had not thought of the reverse index scan on ["date_created", "license"]
, this is an interesting way of achieving the query, although it also depends on the frequency of each license value among the docs.
The indexes in ArangoDB do not have cardinality estimates for different lookup values, but they only maintain a global "selectivity estimate". Thus it is impossible for the ArangoDB query optimizer at the moment to tell how frequently occurring different lookup values actually are. It will assume that all lookup values occur equally often.
This is precious information :+1:
(…) then all their date_created values will be extracted to memory, so that an overall sorted result can be established based on the documents' date_created values. The best case for this query is when no documents match the filter condition. In this case, it will only do the index lookups, no sorts and no document materialization. However, if the filter condition matches documents, then the query will need to do actual work, and it will get slower with more matches.
This seems to be the case indeed, but that was actually the object of my question : why is it so ? Why couldn't just the topN number of documents be materialized for each matching license (according to the LIMIT statement), so that the overall sort on date_created
is faster ? If I'm not mistaken, this can be done "by-hand" with the query below, that runs very fast (8ms). A diff shows that, as expected, results are the same and in the same order as results from Query 2.
Could we imagine that the optimizer does such an optimization by itself, in the case of a SORT / LIMIT
statement using the same field(s) as a FILTER … IN …
statement in the same query ?
LET ccbysa = (
FOR o IN z_obs
OPTIONS { "indexHint": "idx_1735704198117326848" }
FILTER o.license == "cc-by-sa"
SORT o.date_created DESC
LIMIT 0, 100
RETURN {
_key: o._key,
date_created: o.date_created
}
)
LET ccbync = (
FOR o IN z_obs
OPTIONS { "indexHint": "idx_1735704198117326848" }
FILTER o.license == "cc-by-nc"
SORT o.date_created DESC
LIMIT 0, 100
RETURN {
_key: o._key,
date_created: o.date_created
}
)
LET foobar = (
FOR o IN z_obs
OPTIONS { "indexHint": "idx_1735704198117326848" }
FILTER o.license == "foobar"
SORT o.date_created DESC
LIMIT 0, 100
RETURN {
_key: o._key,
date_created: o.date_created
}
)
LET alllicenses = UNION(ccbysa, ccbync, foobar)
FOR doc IN alllicenses
SORT doc.date_created DESC
LIMIT 100
RETURN doc._key
Profile :
Execution plan:
Id NodeType Calls Items Filtered Runtime [s] Comment
1 SingletonNode 1 1 0 0.00001 * ROOT
46 SubqueryStartNode 1 2 0 0.00002 - LET foobar = ( /* subquery begin */
41 IndexNode 1 1 0 0.00021 - FOR o IN z_obs /* reverse persistent index scan, index scan + document lookup (projections: `_key`, `date_created`) */
28 LimitNode 1 1 0 0.00001 - LIMIT 0, 100
29 CalculationNode 1 1 0 0.00001 - LET #30 = { "_key" : o.`_key`, "date_created" : o.`date_created` } /* simple expression */ /* collections used: o : z_obs */
47 SubqueryEndNode 1 1 0 0.00001 - RETURN #30 ) /* subquery end */
44 SubqueryStartNode 1 2 0 0.00001 - LET ccbync = ( /* subquery begin */
40 IndexNode 1 101 0 0.00279 - FOR o IN z_obs /* reverse persistent index scan, index scan + document lookup (projections: `_key`, `date_created`) */
18 LimitNode 1 101 0 0.00001 - LIMIT 0, 100
19 CalculationNode 1 101 0 0.00025 - LET #24 = { "_key" : o.`_key`, "date_created" : o.`date_created` } /* simple expression */ /* collections used: o : z_obs */
45 SubqueryEndNode 1 1 0 0.00005 - RETURN #24 ) /* subquery end */
42 SubqueryStartNode 1 2 0 0.00002 - LET ccbysa = ( /* subquery begin */
39 IndexNode 1 101 0 0.00246 - FOR o IN z_obs /* reverse persistent index scan, index scan + document lookup (projections: `_key`, `date_created`) */
8 LimitNode 1 101 0 0.00001 - LIMIT 0, 100
9 CalculationNode 1 101 0 0.00020 - LET #18 = { "_key" : o.`_key`, "date_created" : o.`date_created` } /* simple expression */ /* collections used: o : z_obs */
43 SubqueryEndNode 1 1 0 0.00008 - RETURN #18 ) /* subquery end */
32 CalculationNode 1 1 0 0.00005 - LET alllicenses = UNION(ccbysa, ccbync, foobar) /* simple expression */
33 EnumerateListNode 1 200 0 0.00013 - FOR doc IN alllicenses /* list iteration */
34 CalculationNode 1 200 0 0.00007 - LET #32 = doc.`date_created` /* attribute expression */
35 SortNode 1 100 0 0.00015 - SORT #32 DESC /* sorting strategy: standard */
36 LimitNode 1 100 0 0.00001 - LIMIT 0, 100
37 CalculationNode 1 100 0 0.00005 - LET #34 = doc.`_key` /* attribute expression */
38 ReturnNode 1 100 0 0.00001 - RETURN #34
Indexes used:
By Name Type Collection Unique Sparse Cache Selectivity Fields Stored values Ranges
41 idx_1735704198117326848 persistent z_obs false false false 49.27 % [ `license`, `date_created` ] [ ] (o.`license` == "foobar")
40 idx_1735704198117326848 persistent z_obs false false false 49.27 % [ `license`, `date_created` ] [ ] (o.`license` == "cc-by-nc")
39 idx_1735704198117326848 persistent z_obs false false false 49.27 % [ `license`, `date_created` ] [ ] (o.`license` == "cc-by-sa")
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 move-filters-up
3 move-calculations-up-2
4 move-filters-up-2
5 use-indexes
6 remove-filter-covered-by-index
7 use-index-for-sort
8 remove-unnecessary-calculations-2
9 move-calculations-down
10 reduce-extraction-to-projection
11 splice-subqueries
Query Statistics:
Writes Exec Writes Ign Scan Full Scan Index Cache Hits/Misses Filtered Peak Mem [b] Exec Time [s]
0 0 0 200 0 / 0 0 98304 0.01713
Query Profile:
Query Stage Duration [s]
initializing 0.00000
parsing 0.00019
optimizing ast 0.00003
loading collections 0.00001
instantiating plan 0.00013
optimizing plan 0.00133
instantiating executors 0.00026
executing 0.01517
finalizing 0.00010
Thank you again for your time, all the precious information about how indexes work, and for considering the suggested optimization.
Mathias
@matcho : your suggestion to split up the query into 3 different lookup subqueries, each with a limit, and then joining their results for getting the data back in total order makes a lot of sense.
However, I wouldn't like the optimizer to transform the query this way automatically, for the following reasons:
LET categories = (FOR c IN category RETURN c._key) FOR doc IN collection FILTER doc.category IN categories ...
.UNION(...)
function requires materializing the results from all subqueries first, which may consume a lot of memory if the results of the subqueries are large.But still the idea is valuable, because the performance improvement is substantial. I tried to a different solution, which will add a limit to the index lookup code. This has the same benefit and performance improvement, but does not require to transform the query automatically or manually. Right now I haven't implemented that properly, just did a quick POC to verify the performance benefit. To make this a proper feature, it would need to be implemented and tested properly. I can't say much about yet if and when this could be done, but it looks like a promising optimization.
OK I understand, the limitations you mention definitely make sense. Looking forward to hearing news about that, then. Thank you very much @jsteemann !
Hi @matcho, we have a working solution for this problem in the PR https://github.com/arangodb/arangodb/pull/21127. Hopefully it will be merged soon and then it will probably come out in the upcoming release 3.12.2.
Great news, thank you @jbajic !
My Environment
Component, Query & Data
Affected feature: Query optimization and execution
AQL query (if applicable):
Q1. (very fast)
Q2. (slow)
AQL explain and/or profile (if applicable): Q1
Q2
Dataset: 3.5M documents in
z_obs
collection. Most of them (2.6M) have a "cc-by-sa" license, 200k have a "cc-by-nc" license, none has a "foobar" license.A 2-field index on [ license, date_created ] (idx_1735704198117326848)
Size of your Dataset on disk: A few dozen GBs.
Steps to reproduce
Run Q1. Result is super fast (3ms), persistent index on
[ license, date_created ]
is chosen as expectedRun Q2 without indexHint. An index with
date_created
as the first field is chosen, result speed is OK (1s) because most docs have one of the expected licenses. But if we change the licenses we look for, for less frequent ones, query gets very slow.Run Q2 with indexHint on
[ license, date_created ]
. Query is slow (2s). If we change the licenses we look for, for less frequent ones, query gets faster.Problem: Speed of Q2 with indexHint seems to get lower when more docs match the FILTER criteria, which suggests that all documents matching each criterion are gathered before computing the top 100 induced by the
LIMIT
statement, instead of gathering only the top 100 for each criterion.Expected result:
One could expect that the query engine applies the SORT / LIMIT to each FILTER value, before computing the global SORT / LIMIT on the union of the preceding results.
Also, one could then expect that the index on
[ license, date_created ]
is chosen without the need of an indexHint.Thank you very much, Greetings,
Mathias