Open lnogol opened 3 years ago
@lnogol Can you share some more details about the edge collections? In arangosh:
db.has_replied.count();
db.has_replied.indexes(true);
db.has_comment.count();
db.has_comment.indexes(true);
It would also be helpful if you could run
db._explain("UPSERT { _from: 'a', _to: 'b' } INSERT {} UPDATE {} IN collection RETURN {}");
for both edge collections, with and without your persistent index, and share the results here (especially the estimates are of interest).
Did you actually measure the performance difference between the best index being used vs. the edge index being chosen?
Did you actually measure the performance difference between the best index being used vs. the edge index being chosen?
not really, but we're experiencing the DB to be slow during load, so I went through all the queries we have to ensure they use the best index possible
the collection where the edge index is used has 1M documents so I would assume it should pick the better index for better performance
either way, UPSERT
reads, just like FOR
, so it should accept an indexHint
, right?
db.has_replied.count();
9
db.has_replied.indexes(true);
[
{
"fields" : [
"_key"
],
"figures" : {
"memory" : 0
},
"id" : "has_replied/0",
"name" : "primary",
"selectivityEstimate" : 1,
"sparse" : false,
"type" : "primary",
"unique" : true
},
{
"fields" : [
"_from",
"_to"
],
"figures" : {
"memory" : 0
},
"id" : "has_replied/2",
"name" : "edge",
"selectivityEstimate" : 0.6666666666666666,
"sparse" : false,
"type" : "edge",
"unique" : false
},
{
"deduplicate" : true,
"fields" : [
"_from",
"_to"
],
"figures" : {
"memory" : 0
},
"id" : "has_replied/50401978",
"inBackground" : true,
"name" : "from-to",
"selectivityEstimate" : 1,
"sparse" : false,
"type" : "persistent",
"unique" : false
}
]
db.has_comment.count();
1099100
db.has_comment.indexes(true);
[
{
"fields" : [
"_key"
],
"figures" : {
"memory" : 0
},
"id" : "has_comment/0",
"name" : "primary",
"selectivityEstimate" : 1,
"sparse" : false,
"type" : "primary",
"unique" : true
},
{
"fields" : [
"_from",
"_to"
],
"figures" : {
"memory" : 0
},
"id" : "has_comment/2",
"name" : "edge",
"selectivityEstimate" : 0.5377086735119772,
"sparse" : false,
"type" : "edge",
"unique" : false
},
{
"deduplicate" : true,
"fields" : [
"_from",
"_to"
],
"figures" : {
"memory" : 0
},
"id" : "has_comment/56250941",
"inBackground" : true,
"name" : "from-to",
"selectivityEstimate" : 0.9998779445868424,
"sparse" : false,
"type" : "persistent",
"unique" : false
}
]
has_replied
with index
Execution plan:
Id NodeType Site Est. Comment
1 SingletonNode COOR 1 * ROOT
24 SubqueryStartNode COOR 1 - LET #3 = ( /* subquery begin */
25 ScatterNode COOR 1 - SCATTER
26 RemoteNode DBS 1 - REMOTE
15 IndexNode DBS 1 - FOR #1 IN has_replied /* persistent index scan, 1 shard(s) */
22 RemoteNode COOR 1 - REMOTE
23 GatherNode COOR 1 - GATHER /* unsorted */
6 LimitNode COOR 1 - LIMIT 0, 1
27 SubqueryEndNode COOR 1 - RETURN #1 ) /* subquery end */
10 CalculationNode COOR 1 - LET #8 = { } /* json expression */ /* const assignment */
9 CalculationNode COOR 1 - LET $OLD = #3[0] /* simple expression */
16 DistributeNode COOR 1 - DISTRIBUTE /* create keys: true, variable: $OLD */
17 RemoteNode DBS 1 - REMOTE
12 UpsertNode DBS 1 - UPSERT $OLD INSERT #8 UPDATE #8 IN has_replied
18 RemoteNode COOR 1 - REMOTE
19 GatherNode COOR 1 - GATHER /* unsorted */
14 ReturnNode COOR 1 - RETURN #8
Indexes used:
By Name Type Collection Unique Sparse Selectivity Fields Ranges
15 from-to persistent has_replied false false 100.00 % [ _from
, _to
] ((#1._from
== "a") && (#1._to
== "b"))
12 primary primary has_replied true false 100.00 % [ _key
] $OLD
Optimization rules applied: Id RuleName 1 move-calculations-up 2 remove-redundant-calculations 3 remove-unnecessary-calculations 4 remove-data-modification-out-variables 5 use-indexes 6 remove-filter-covered-by-index 7 remove-unnecessary-calculations-2 8 move-calculations-down 9 distribute-in-cluster 10 scatter-in-cluster 11 remove-unnecessary-remote-scatter 12 splice-subqueries
Write query options: Option Value waitForSync false skipDocumentValidation false keepNull true mergeObjects true ignoreRevs true isRestore false ignoreErrors false ignoreDocumentNotFound false readCompleteInput false consultAqlWriteFilter false exclusive false nullMeansRemove false
- `has_replied` without index
Execution plan:
Id NodeType Site Est. Comment
1 SingletonNode COOR 1 ROOT
25 SubqueryStartNode COOR 1 - LET #3 = ( / subquery begin /
26 ScatterNode COOR 1 - SCATTER
27 RemoteNode DBS 1 - REMOTE
15 IndexNode DBS 1 - FOR #1 IN has_replied / edge index scan, 1 shard(s) / FILTER (#1._from
== "a") / early pruning /
23 RemoteNode COOR 1 - REMOTE
24 GatherNode COOR 1 - GATHER / unsorted /
6 LimitNode COOR 1 - LIMIT 0, 1
28 SubqueryEndNode COOR 1 - RETURN #1 ) / subquery end /
10 CalculationNode COOR 1 - LET #8 = { } / json expression / / const assignment /
9 CalculationNode COOR 1 - LET $OLD = #3[0] / simple expression /
17 DistributeNode COOR 1 - DISTRIBUTE / create keys: true, variable: $OLD /
18 RemoteNode DBS 1 - REMOTE
12 UpsertNode DBS 1 - UPSERT $OLD INSERT #8 UPDATE #8 IN has_replied
19 RemoteNode COOR 1 - REMOTE
20 GatherNode COOR 1 - GATHER / unsorted */
14 ReturnNode COOR 1 - RETURN #8
Indexes used:
By Name Type Collection Unique Sparse Selectivity Fields Ranges
15 edge edge has_replied false false 100.00 % [ _to
] (#1._to
== "b")
12 primary primary has_replied true false 100.00 % [ _key
] $OLD
Optimization rules applied: Id RuleName 1 move-calculations-up 2 remove-redundant-calculations 3 remove-unnecessary-calculations 4 remove-data-modification-out-variables 5 use-indexes 6 remove-filter-covered-by-index 7 move-calculations-down 8 distribute-in-cluster 9 scatter-in-cluster 10 distribute-filtercalc-to-cluster 11 remove-unnecessary-remote-scatter 12 move-filters-into-enumerate 13 splice-subqueries
Write query options: Option Value waitForSync false skipDocumentValidation false keepNull true mergeObjects true ignoreRevs true isRestore false ignoreErrors false ignoreDocumentNotFound false readCompleteInput false consultAqlWriteFilter false exclusive false nullMeansRemove false
- `has_comment` with index
Execution plan:
Id NodeType Site Est. Comment
1 SingletonNode COOR 1 ROOT
25 SubqueryStartNode COOR 1 - LET #3 = ( / subquery begin /
26 ScatterNode COOR 1 - SCATTER
27 RemoteNode DBS 1 - REMOTE
15 IndexNode DBS 1 - FOR #1 IN has_comment / edge index scan, 1 shard(s) / FILTER (#1._from
== "a") / early pruning /
23 RemoteNode COOR 1 - REMOTE
24 GatherNode COOR 1 - GATHER / unsorted /
6 LimitNode COOR 1 - LIMIT 0, 1
28 SubqueryEndNode COOR 1 - RETURN #1 ) / subquery end /
10 CalculationNode COOR 1 - LET #8 = { } / json expression / / const assignment /
9 CalculationNode COOR 1 - LET $OLD = #3[0] / simple expression /
17 DistributeNode COOR 1 - DISTRIBUTE / create keys: true, variable: $OLD /
18 RemoteNode DBS 1 - REMOTE
12 UpsertNode DBS 1 - UPSERT $OLD INSERT #8 UPDATE #8 IN has_comment
19 RemoteNode COOR 1 - REMOTE
20 GatherNode COOR 1 - GATHER / unsorted */
14 ReturnNode COOR 1 - RETURN #8
Indexes used:
By Name Type Collection Unique Sparse Selectivity Fields Ranges
15 edge edge has_comment false false 99.99 % [ _to
] (#1._to
== "b")
12 primary primary has_comment true false 100.00 % [ _key
] $OLD
Optimization rules applied: Id RuleName 1 move-calculations-up 2 remove-redundant-calculations 3 remove-unnecessary-calculations 4 remove-data-modification-out-variables 5 use-indexes 6 remove-filter-covered-by-index 7 move-calculations-down 8 distribute-in-cluster 9 scatter-in-cluster 10 distribute-filtercalc-to-cluster 11 remove-unnecessary-remote-scatter 12 move-filters-into-enumerate 13 splice-subqueries
Write query options: Option Value waitForSync false skipDocumentValidation false keepNull true mergeObjects true ignoreRevs true isRestore false ignoreErrors false ignoreDocumentNotFound false readCompleteInput false consultAqlWriteFilter false exclusive false nullMeansRemove false
- `has_comment` without index
Execution plan:
Id NodeType Site Est. Comment
1 SingletonNode COOR 1 ROOT
25 SubqueryStartNode COOR 1 - LET #3 = ( / subquery begin /
26 ScatterNode COOR 1 - SCATTER
27 RemoteNode DBS 1 - REMOTE
15 IndexNode DBS 1 - FOR #1 IN has_comment / edge index scan, 1 shard(s) / FILTER (#1._from
== "a") / early pruning /
23 RemoteNode COOR 1 - REMOTE
24 GatherNode COOR 1 - GATHER / unsorted /
6 LimitNode COOR 1 - LIMIT 0, 1
28 SubqueryEndNode COOR 1 - RETURN #1 ) / subquery end /
10 CalculationNode COOR 1 - LET #8 = { } / json expression / / const assignment /
9 CalculationNode COOR 1 - LET $OLD = #3[0] / simple expression /
17 DistributeNode COOR 1 - DISTRIBUTE / create keys: true, variable: $OLD /
18 RemoteNode DBS 1 - REMOTE
12 UpsertNode DBS 1 - UPSERT $OLD INSERT #8 UPDATE #8 IN has_comment
19 RemoteNode COOR 1 - REMOTE
20 GatherNode COOR 1 - GATHER / unsorted */
14 ReturnNode COOR 1 - RETURN #8
Indexes used:
By Name Type Collection Unique Sparse Selectivity Fields Ranges
15 edge edge has_comment false false 99.99 % [ _to
] (#1._to
== "b")
12 primary primary has_comment true false 100.00 % [ _key
] $OLD
Optimization rules applied: Id RuleName 1 move-calculations-up 2 remove-redundant-calculations 3 remove-unnecessary-calculations 4 remove-data-modification-out-variables 5 use-indexes 6 remove-filter-covered-by-index 7 move-calculations-down 8 distribute-in-cluster 9 scatter-in-cluster 10 distribute-filtercalc-to-cluster 11 remove-unnecessary-remote-scatter 12 move-filters-into-enumerate 13 splice-subqueries
Write query options: Option Value waitForSync false skipDocumentValidation false keepNull true mergeObjects true ignoreRevs true isRestore false ignoreErrors false ignoreDocumentNotFound false readCompleteInput false consultAqlWriteFilter false exclusive false nullMeansRemove false
Thanks for sharing all the information! I created an internal ticket APM-121 about adding indexHint support to UPSERT.
The default index choice is quite surprising... has_comment
has a million documents, but the edge index with a lower selectivity estimate is picked. has_replied
has just 9 documents and the combined index is chosen. The opposite would be expected here: if there aren't many documents in a collection, then it shouldn't matter much which index is chosen and the optimizer should prefer the edge index over secondary indexes by design. For example, with an edge collection with <1000 documents and a vertex-centric index, we would expect the optimizer to pick the edge index even if the vertex-centric index would be optimal. With more documents in the collection, the index choice should change to the vertex-centric index.
As a possible workaround, can you try the following?
FOR doc IN collection
FILTER doc._from == 'a' && doc._to == 'b'
UPSERT { _key: doc._key }
INSERT {}
UPDATE {}
IN collection
RETURN {}
If it chooses the edge index, then you can try to add an indexHint
to the FOR loop.
FOR doc IN has_comment OPTIONS { indexHint: 'from-to' }
FILTER doc._from == 'a' && doc._to == 'b'
UPSERT { _key: doc._key }
INSERT {}
UPDATE {}
IN has_comment
RETURN {}
this query picks the secondary index, not without the indexHint though
anyway, correct me if I'm wrong, that's not the same as
UPSERT { _from: 'a', _to: 'b' }
INSERT {}
UPDATE {}
IN has_comment
RETURN {}
because your suggested workaround would never execute INSERT
You are right, the for loop precludes the insertion of new documents and you could simply use a standard UPDATE to do the same.
Regarding the bad index choice, this fix might help: https://github.com/arangodb/arangodb/pull/14422
Support for indexHint was added to UPSERT in 3.9.0:
https://github.com/arangodb/arangodb/pull/14817
APM-121: allow the UPSERT query to have indexHint as an extra parameter for OPTIONS. It will be used as a hint by the inner FOR loop that is performed as part of the UPSERT query, and would help in cases such as UPSERT not picking the best index automatically for lookup.
My Environment
networkAttachedVolumes: true
on theArangoDeployment
,--server.statistics=true
as extra argarangodb/arangodb:3.7.11
)Component, Query & Data
Affected feature:
UPSERT
index usageSize of your Dataset on disk: 20GB
Replication Factor & Number of Shards (Cluster only): all collections have
replicationFactor: 3, writeConcern: 2
Number of shards: 1
Steps to reproduce
Problem:
I have 2 identical edge collections (same setting, same indexes). The same
UPSERT
query:uses the correct (best) index for the first collection but uses the worse one for the second collection (see image above)
I thought maybe I could provide an
indexHint
but that doesn't appear to be possible inUPSERT
and probably should be.Expected result:
either
UPSERT
allows to provide anindexHint