Poor performance on OR filter (multiple values) with SORT statement and appropriate 2-field index

My Environment

ArangoDB Version: 3.9.0
Deployment Mode: Single Server
Deployment Strategy: systemd
Infrastructure: own
Operating System: Ubuntu 20.04
Total RAM in your machine: 128G
Disks in use: HDD
Used Package: Ubuntu .deb

Component, Query & Data

Affected feature: Query optimization and execution

AQL query (if applicable):

Q1. (very fast)

FOR o IN z_obs
    FILTER o.license IN [ "cc-by-sa" ]
    SORT o.date_created DESC
    LIMIT 0, 100
    RETURN o._key

Q2. (slow)

FOR o IN z_obs
    OPTIONS { "indexHint": "idx_1735704198117326848" }
    FILTER o.license IN [ "cc-by-sa", "foobar", "cc-by-nc" ]
    SORT o.date_created DESC
    LIMIT 0, 100
    RETURN o._key

AQL explain and/or profile (if applicable): Q1

Query String (118 chars, cacheable: true):
 FOR o IN z_obs
     FILTER o.license IN [ "cc-by-sa" ]
     SORT o.date_created DESC
     LIMIT 0, 100
     RETURN o._key

Execution plan:
 Id   NodeType            Est.   Comment
  1   SingletonNode          1   * ROOT
 10   IndexNode         156989     - FOR o IN z_obs   /* reverse persistent index scan, projections: `_key` */    
  7   LimitNode            100       - LIMIT 0, 100
  8   CalculationNode      100       - LET #5 = o.`_key`   /* attribute expression */   /* collections used: o : z_obs */
  9   ReturnNode           100       - RETURN #5

Indexes used:
 By   Name                      Type         Collection   Unique   Sparse   Selectivity   Fields                          Ranges
 10   idx_1735704198117326848   persistent   z_obs        false    false        86.60 %   [ `license`, `date_created` ]   (o.`license` == "cc-by-sa")

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   move-calculations-up-2
  4   move-filters-up-2
  5   use-indexes
  6   remove-filter-covered-by-index
  7   use-index-for-sort
  8   remove-unnecessary-calculations-2
  9   reduce-extraction-to-projection

Optimization rules with highest execution times:
 RuleName                                    Duration [s]
 use-indexes                                      0.00010
 move-filters-up                                  0.00002
 remove-filter-covered-by-index                   0.00002
 move-calculations-up                             0.00002
 use-index-for-sort                               0.00001

41 rule(s) executed, 1 plan(s) created

Query String (195 chars, cacheable: true):
 FOR o IN z_obs
     OPTIONS { "indexHint": "idx_1735704198117326848" }
     FILTER o.license IN [ "cc-by-sa", "foobar", "cc-by-nc" ]
     SORT o.date_created DESC
     LIMIT 0, 100
     RETURN o._key

Execution plan:
 Id   NodeType            Est.   Comment
  1   SingletonNode          1   * ROOT
 10   IndexNode         313979     - FOR o IN z_obs LET #3 = o.`date_created`   /* persistent index scan, projections: `_key`, `date_created` */    /* with late materialization */
  6   SortNode          313979       - SORT #3 DESC   /* sorting strategy: constrained heap */
  7   LimitNode            100       - LIMIT 0, 100
 11   MaterializeNode      100       - MATERIALIZE o
  8   CalculationNode      100       - LET #5 = o.`_key`   /* attribute expression */   /* collections used: o : z_obs */
  9   ReturnNode           100       - RETURN #5

Indexes used:
 By   Name                      Type         Collection   Unique   Sparse   Selectivity   Fields                          Ranges
 10   idx_1735704198117326848   persistent   z_obs        false    false        86.60 %   [ `license`, `date_created` ]   (o.`license` IN [ "cc-by-nc", "cc-by-sa", "foobar" ])

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   move-calculations-up-2
  4   move-filters-up-2
  5   use-indexes
  6   remove-filter-covered-by-index
  7   remove-unnecessary-calculations-2
  8   sort-limit
  9   reduce-extraction-to-projection
 10   late-document-materialization

Optimization rules with highest execution times:
 RuleName                                    Duration [s]
 use-indexes                                      0.00010
 remove-filter-covered-by-index                   0.00003
 late-document-materialization                    0.00003
 reduce-extraction-to-projection                  0.00002
 move-calculations-up                             0.00002

41 rule(s) executed, 1 plan(s) created

Dataset: 3.5M documents in z_obs collection. Most of them (2.6M) have a "cc-by-sa" license, 200k have a "cc-by-nc" license, none has a "foobar" license.

A 2-field index on [ license, date_created ] (idx_1735704198117326848)

Size of your Dataset on disk: A few dozen GBs.

Steps to reproduce

Run Q1. Result is super fast (3ms), persistent index on [ license, date_created ] is chosen as expected
Run Q2 without indexHint. An index with date_created as the first field is chosen, result speed is OK (1s) because most docs have one of the expected licenses. But if we change the licenses we look for, for less frequent ones, query gets very slow.
Run Q2 with indexHint on [ license, date_created ]. Query is slow (2s). If we change the licenses we look for, for less frequent ones, query gets faster.

Problem: Speed of Q2 with indexHint seems to get lower when more docs match the FILTER criteria, which suggests that all documents matching each criterion are gathered before computing the top 100 induced by the LIMIT statement, instead of gathering only the top 100 for each criterion.

Expected result:

One could expect that the query engine applies the SORT / LIMIT to each FILTER value, before computing the global SORT / LIMIT on the union of the preceding results.

Also, one could then expect that the index on [ license, date_created ] is chosen without the need of an indexHint.

Thank you very much, Greetings,

Mathias

Confirmed in 3.11.6.

Any update on this please ?

@matcho : I will have a look at the issue today

I created the following dataset for the analysis:

c = db._create("z_obs"); 
docs = []; 
for (i = 0; i < 3500000; ++i) { 
  docs.push({ 
    _key: "test" + i, 
    date_created: "20240613" + i, 
    license: (i <= 2600000 ? "cc-by-sa" : (i < 28000000 ? "cc-by-nc" : "piff")) 
  }); 
  if (docs.length === 10000) { 
    c.insert(docs); 
    docs = []; 
  } 
} 
c.ensureIndex({ name: "license-date", type: "persistent", fields: ["license", "date_created"] }); 
c.ensureIndex({ name: "date-license", type: "persistent", fields: ["date_created", "license"] });

Running the query without any index hint produces the following execution plan with 3.12. The query is using the index on ["date_created", "license" ]:

> db._profileQuery(`FOR o IN z_obs FILTER o.license IN [ "cc-by-sa", "foobar", "cc-by-nc" ] SORT o.date_created DESC LIMIT 0, 100 RETURN o._key`);

Execution plan:
 Id   NodeType          Calls   Par   Items   Filtered   Runtime [s]   Comment
  1   SingletonNode         1     -       1          0       0.00001   * ROOT
 10   IndexNode             1     -     100          0       0.00017     - FOR o IN z_obs   /* reverse persistent index scan, index only (filter projections: `license`) */    FILTER (o.`license` IN [ "cc-by-sa", "foobar", "cc-by-nc" ])   /* early pruning */   /* with late materialization */
 11   MaterializeNode       1     -     100          0       0.00019       - MATERIALIZE o /* (projections: `_key`) */
  7   LimitNode             1     -     100          0       0.00001       - LIMIT 0, 100
  8   CalculationNode       1     0     100          0       0.00003       - LET #3 = o.`_key`   /* attribute expression */   /* collections used: o : z_obs */
  9   ReturnNode            1     -     100          0       0.00001       - RETURN #3

Indexes used:
 By   Name           Type         Collection   Unique   Sparse   Cache   Selectivity   Fields                          Stored values   Ranges
 10   date-license   persistent   z_obs        false    false    false      100.00 %   [ `date_created`, `license` ]   [  ]            *

Query Statistics:
 Writes Exec      Writes Ign      Doc. Lookups      Scan Full      Scan Index      Cache Hits/Misses      Filtered      Peak Mem [b]      Exec Time [s]
           0               0               100              0             100                  0 / 0             0                 0            0.00086

Forcing the query to use the index on ["license", "date_created"] also works, but it will make the query run significantly slower with my dataset:

> db._profileQuery(`FOR o IN z_obs OPTIONS {indexHint: "license-date", forceIndexHint: true} FILTER o.license IN [ "cc-by-sa", "foobar", "cc-by-nc" ] SORT o.date_created DESC LIMIT 0, 100 RETURN o._key`);

Execution plan:
 Id   NodeType          Calls   Par     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode         1     -         1          0       0.00001   * ROOT
 10   IndexNode          3501     -   3500000          0       1.00962     - FOR o IN z_obs   /* persistent index scan, index only (projections: `date_created`) */    LET #2 = o.`date_created`   /* with late materialization */
  6   SortNode              1     -       100          0       1.14253       - SORT #2 DESC   /* sorting strategy: constrained heap */
  7   LimitNode             1     -       100          0       0.00001       - LIMIT 0, 100
 11   MaterializeNode       1     -       100          0       0.00016       - MATERIALIZE o
  8   CalculationNode       1     0       100          0       0.00003       - LET #3 = o.`_key`   /* attribute expression */   /* collections used: o : z_obs */
  9   ReturnNode            1     -       100          0       0.00002       - RETURN #3

Indexes used:
 By   Name           Type         Collection   Unique   Sparse   Cache   Selectivity   Fields                          Stored values   Ranges
 10   license-date   persistent   z_obs        false    false    false       99.98 %   [ `license`, `date_created` ]   [  ]            (o.`license` IN [ "cc-by-nc", "cc-by-sa", "foobar" ])

Query Statistics:
 Writes Exec      Writes Ign      Doc. Lookups      Scan Full      Scan Index      Cache Hits/Misses      Filtered      Peak Mem [b]      Exec Time [s]
           0               0           3500000              0         3500000                  0 / 0             0            196608            2.15386

With the dataset I used, doing a reverse index scan on the index on ["date_created", "license"] an then filtering out the index entries with a non-matching license is a good choice. The query can scan the index to get all documents sorted by date_created, so it can avoid sorting all matches in memory. Additionally the query needs to post-filter the index entries using the license attribute, which is also covered by the index. The query can stop processing once it has found 100 matching index entries. So the performance of this query will depend on how frequent the searched-for license values appear in the dataset. If they appear often and are somewhat equally distributed, this query will be very fast. Looking for a non-existing license will mean however that all index entries will be scanned and the post-filter will be applied to all of them, resulting in no matches. This will mean the worst case performance of the query is O(n), where n is the number of index entries.

When using the other index on ["license", "date_created"], the query can use the index to efficiently find all matching entries for the search-for licenses. If the filter condition matches for many documents, then all their date_created values will be extracted to memory, so that an overall sorted result can be established based on the documents' date_created values. The best case for this query is when no documents match the filter condition. In this case, it will only do the index lookups, no sorts and no document materialization. However, if the filter condition matches documents, then the query will need to do actual work, and it will get slower with more matches. The worst case performance of this query is also O(n), when n is the number of index entries. This will only happen however if all documents match the filter condition.

So the takeaway is that using the ["date_created", "license"] index is optimal when looking for frequently-occurring license values. And when looking for infrequently occurring license values or even non-existing ones, it will be very slow.

On the opposite, using the index on ["license", "date_created"] is good when looking for infrequently occurring or non-existing licenses. And using that index is bad when looking for very frequently occurring licenses.

So the optimal plan depends on which lookup values are used for the query. The indexes in ArangoDB do not have cardinality estimates for different lookup values, but they only maintain a global "selectivity estimate". Thus it is impossible for the ArangoDB query optimizer at the moment to tell how frequently occurring different lookup values actually are. It will assume that all lookup values occur equally often.

In case it is known in advance how frequently a value occurs in the index, the application could use an index hint to push query plan selection in the right direction. In case the frequency is not known, the optimizer will likely prefer the index that avoids the post-sorting, at least it did in my case.

Hello @jsteemann ,

First, thank you very much for investigating on this case.

We had not thought of the reverse index scan on ["date_created", "license"], this is an interesting way of achieving the query, although it also depends on the frequency of each license value among the docs.

The indexes in ArangoDB do not have cardinality estimates for different lookup values, but they only maintain a global "selectivity estimate". Thus it is impossible for the ArangoDB query optimizer at the moment to tell how frequently occurring different lookup values actually are. It will assume that all lookup values occur equally often.

This is precious information :+1:

(…) then all their date_created values will be extracted to memory, so that an overall sorted result can be established based on the documents' date_created values. The best case for this query is when no documents match the filter condition. In this case, it will only do the index lookups, no sorts and no document materialization. However, if the filter condition matches documents, then the query will need to do actual work, and it will get slower with more matches.

This seems to be the case indeed, but that was actually the object of my question : why is it so ? Why couldn't just the topN number of documents be materialized for each matching license (according to the LIMIT statement), so that the overall sort on date_created is faster ? If I'm not mistaken, this can be done "by-hand" with the query below, that runs very fast (8ms). A diff shows that, as expected, results are the same and in the same order as results from Query 2.

Could we imagine that the optimizer does such an optimization by itself, in the case of a SORT / LIMIT statement using the same field(s) as a FILTER … IN … statement in the same query ?

LET ccbysa = (
    FOR o IN z_obs
        OPTIONS { "indexHint": "idx_1735704198117326848" }
        FILTER o.license == "cc-by-sa"
        SORT o.date_created DESC
        LIMIT 0, 100
        RETURN {
            _key: o._key,
            date_created: o.date_created
        }
)
LET ccbync = (
    FOR o IN z_obs
        OPTIONS { "indexHint": "idx_1735704198117326848" }
        FILTER o.license == "cc-by-nc"
        SORT o.date_created DESC
        LIMIT 0, 100
        RETURN {
            _key: o._key,
            date_created: o.date_created
        }
)
LET foobar = (
    FOR o IN z_obs
        OPTIONS { "indexHint": "idx_1735704198117326848" }
        FILTER o.license == "foobar"
        SORT o.date_created DESC
        LIMIT 0, 100
        RETURN {
            _key: o._key,
            date_created: o.date_created
        }
)

LET alllicenses = UNION(ccbysa, ccbync, foobar)

FOR doc IN alllicenses
    SORT doc.date_created DESC
    LIMIT 100
    RETURN doc._key

Profile :

Execution plan:
 Id   NodeType            Calls   Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1       1          0       0.00001   * ROOT
 46   SubqueryStartNode       1       2          0       0.00002     - LET foobar = ( /* subquery begin */
 41   IndexNode               1       1          0       0.00021       - FOR o IN z_obs   /* reverse persistent index scan, index scan + document lookup (projections: `_key`, `date_created`) */    
 28   LimitNode               1       1          0       0.00001         - LIMIT 0, 100
 29   CalculationNode         1       1          0       0.00001         - LET #30 = { "_key" : o.`_key`, "date_created" : o.`date_created` }   /* simple expression */   /* collections used: o : z_obs */
 47   SubqueryEndNode         1       1          0       0.00001         - RETURN  #30 ) /* subquery end */
 44   SubqueryStartNode       1       2          0       0.00001     - LET ccbync = ( /* subquery begin */
 40   IndexNode               1     101          0       0.00279       - FOR o IN z_obs   /* reverse persistent index scan, index scan + document lookup (projections: `_key`, `date_created`) */    
 18   LimitNode               1     101          0       0.00001         - LIMIT 0, 100
 19   CalculationNode         1     101          0       0.00025         - LET #24 = { "_key" : o.`_key`, "date_created" : o.`date_created` }   /* simple expression */   /* collections used: o : z_obs */
 45   SubqueryEndNode         1       1          0       0.00005         - RETURN  #24 ) /* subquery end */
 42   SubqueryStartNode       1       2          0       0.00002     - LET ccbysa = ( /* subquery begin */
 39   IndexNode               1     101          0       0.00246       - FOR o IN z_obs   /* reverse persistent index scan, index scan + document lookup (projections: `_key`, `date_created`) */    
  8   LimitNode               1     101          0       0.00001         - LIMIT 0, 100
  9   CalculationNode         1     101          0       0.00020         - LET #18 = { "_key" : o.`_key`, "date_created" : o.`date_created` }   /* simple expression */   /* collections used: o : z_obs */
 43   SubqueryEndNode         1       1          0       0.00008         - RETURN  #18 ) /* subquery end */
 32   CalculationNode         1       1          0       0.00005     - LET alllicenses = UNION(ccbysa, ccbync, foobar)   /* simple expression */
 33   EnumerateListNode       1     200          0       0.00013     - FOR doc IN alllicenses   /* list iteration */
 34   CalculationNode         1     200          0       0.00007       - LET #32 = doc.`date_created`   /* attribute expression */
 35   SortNode                1     100          0       0.00015       - SORT #32 DESC   /* sorting strategy: standard */
 36   LimitNode               1     100          0       0.00001       - LIMIT 0, 100
 37   CalculationNode         1     100          0       0.00005       - LET #34 = doc.`_key`   /* attribute expression */
 38   ReturnNode              1     100          0       0.00001       - RETURN #34

Indexes used:
 By   Name                      Type         Collection   Unique   Sparse   Cache   Selectivity   Fields                          Stored values   Ranges
 41   idx_1735704198117326848   persistent   z_obs        false    false    false       49.27 %   [ `license`, `date_created` ]   [  ]            (o.`license` == "foobar")
 40   idx_1735704198117326848   persistent   z_obs        false    false    false       49.27 %   [ `license`, `date_created` ]   [  ]            (o.`license` == "cc-by-nc")
 39   idx_1735704198117326848   persistent   z_obs        false    false    false       49.27 %   [ `license`, `date_created` ]   [  ]            (o.`license` == "cc-by-sa")

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   move-calculations-up-2
  4   move-filters-up-2
  5   use-indexes
  6   remove-filter-covered-by-index
  7   use-index-for-sort
  8   remove-unnecessary-calculations-2
  9   move-calculations-down
 10   reduce-extraction-to-projection
 11   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Peak Mem [b]   Exec Time [s]
           0            0           0          200               0 / 0          0          98304         0.01713

Query Profile:
 Query Stage               Duration [s]
 initializing                   0.00000
 parsing                        0.00019
 optimizing ast                 0.00003
 loading collections            0.00001
 instantiating plan             0.00013
 optimizing plan                0.00133
 instantiating executors        0.00026
 executing                      0.01517
 finalizing                     0.00010

Thank you again for your time, all the precious information about how indexes work, and for considering the suggested optimization.

Mathias

@matcho : your suggestion to split up the query into 3 different lookup subqueries, each with a limit, and then joining their results for getting the data back in total order makes a lot of sense.

However, I wouldn't like the optimizer to transform the query this way automatically, for the following reasons:

There will be significantly more execution nodes in the execution plan when doing this transformation, which may blow up the execution plans for large queries and may make them violate built-in limits for the maximum number of execution nodes in a query. More execution nodes also may larger query plan optimization times, because the larger plans will be visible to all following plan optimizations.
The optimization does not work if the lookup values are not hard-coded but dynamically determined at runtime, e.g. LET categories = (FOR c IN category RETURN c._key) FOR doc IN collection FILTER doc.category IN categories ....
Running multiple subqueries and then joining their results together via UNION(...) function requires materializing the results from all subqueries first, which may consume a lot of memory if the results of the subqueries are large.

But still the idea is valuable, because the performance improvement is substantial. I tried to a different solution, which will add a limit to the index lookup code. This has the same benefit and performance improvement, but does not require to transform the query automatically or manually. Right now I haven't implemented that properly, just did a quick POC to verify the performance benefit. To make this a proper feature, it would need to be implemented and tested properly. I can't say much about yet if and when this could be done, but it looks like a promising optimization.

OK I understand, the limitations you mention definitely make sense. Looking forward to hearing news about that, then. Thank you very much @jsteemann !

Hi @matcho, we have a working solution for this problem in the PR https://github.com/arangodb/arangodb/pull/21127. Hopefully it will be merged soon and then it will probably come out in the upcoming release 3.12.2.

Great news, thank you @jbajic !

arangodb / arangodb