arangodb / arangodb

🥑 ArangoDB is a native multi-model database with flexible data models for documents, graphs, and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript extensions.
https://www.arangodb.com
Other
13.48k stars 832 forks source link

ArangoSearch view: sort after pagination breaks view performance #20983

Open matcho opened 2 months ago

matcho commented 2 months ago

My Environment

Component, Query & Data

Affected feature: AQL query using ArangoSearch view.

We are in a case where we have to paginate results of an ArangoSearch view according to its primary sort field/order, which is very efficient, but also have to apply a secondary sort on _key to ensure results order, as the primary sort may have identical values in the output.

Trying to apply this secondary sort after pagination to keep the excellent performance given by the view's primary sort does not work.

AQL query (if applicable):

common part of all queries: a polygon roughly representing France

LET france = [
[
    [
        2.9451529973202923,
        42.56519369560516
        ],
        [
        3.028384657289564,
        43.16018921577199
        ],
        [
        3.876766215111644,
        43.5786612647569
        ],
        [
        4.398984405007951,
        43.45601573567811
        ],
        [
        5.0093594124186325,
        43.34006816741453
        ],
        [
        5.747557513089987,
        43.03040783742591
        ],
        [
        6.303429422950643,
        43.15940138076181
        ],
        [
        7.986591947510675,
        43.95432691596994
        ],
        [
        7.25079300615576,
        44.13676669088349
        ],
        [
        6.799998138643161,
        44.25008542230151
        ],
        [
        6.990668431316976,
        44.71231367327766
        ],
        [
        6.6223368454342335,
        45.02759185570113
        ],
        [
        7.060450746338972,
        45.15559025543743
        ],
        [
        7.078599998021048,
        45.48050306310927
        ],
        [
        6.766156417033841,
        45.73902899659615
        ],
        [
        6.434484308273198,
        46.288737743712915
        ],
        [
        5.9607513357667585,
        46.21272211846565
        ],
        [
        5.7775084764502935,
        46.33510065897019
        ],
        [
        6.524532055475703,
        46.816445588486886
        ],
        [
        7.4732835380936535,
        47.624328455048214
        ],
        [
        7.658499374053349,
        48.31395798849806
        ],
        [
        8.285916876254191,
        49.00280546272157
        ],
        [
        7.2555762475586505,
        49.0556907660706
        ],
        [
        6.459531657163751,
        49.358114191285296
        ],
        [
        5.649767518068785,
        49.535805941801186
        ],
        [
        5.017250538484518,
        50.178848616854594
        ],
        [
        4.67291387023829,
        49.950987210243056
        ],
        [
        4.016794830237075,
        50.118483239967844
        ],
        [
        3.3686575247311907,
        50.75748303512404
        ],
        [
        2.630261638674483,
        51.36461684652656
        ],
        [
        1.3247228937639193,
        50.98291526880743
        ],
        [
        1.3477230868985828,
        50.29757588063177
        ],
        [
        0.9405327210914152,
        50.004247045507896
        ],
        [
        0.06029608464942271,
        49.567116723682574
        ],
        [
        -0.168797363273967,
        49.29794087502833
        ],
        [
        -1.5118288319860937,
        49.68224213505317
        ],
        [
        -2.3700804644700497,
        49.9352631542921
        ],
        [
        -1.8964209798128877,
        48.67870368052783
        ],
        [
        -2.7999984446888107,
        48.648184926091545
        ],
        [
        -3.4320476339123616,
        48.90181183763346
        ],
        [
        -5.062217890871096,
        48.7829635278093
        ],
        [
        -4.915127395897656,
        48.09550619935575
        ],
        [
        -4.331997011705312,
        47.561089950202074
        ],
        [
        -3.889925905192257,
        47.73183697763244
        ],
        [
        -2.1584912477609066,
        47.127315767140146
        ],
        [
        -2.066621991428633,
        46.73966360988828
        ],
        [
        -1.149460824727072,
        46.37963866929468
        ],
        [
        -1.0367935316345722,
        45.99339942620742
        ],
        [
        -1.358724693721257,
        46.055256556955015
        ],
        [
        -1.327045588243692,
        45.697812990147185
        ],
        [
        -0.6045129606267494,
        45.05492464045966
        ],
        [
        -1.1456356959797347,
        45.36442313653396
        ],
        [
        -1.0553148946762008,
        44.65358946945145
        ],
        [
        -1.5635398211772724,
        43.233232969351945
        ],
        [
        -0.47795042840178326,
        42.780379042015824
        ],
        [
        0.7059118926233623,
        42.694800602268
        ],
        [
        0.8756249148560755,
        42.93203078878375
        ],
        [
        1.6422610084557618,
        42.66599606570023
        ],
        [
        2.9451529973202923,
        42.56519369560516
        ]
]
]

query 1 (quick)

FOR o in obs_geo_view
    SEARCH ANALYZER(GEO_CONTAINS(GEO_POLYGON(france), o.geoloc), "geopoint_pn")
    SORT o.date_obs DESC
    LIMIT 20000, 1000
    RETURN o._key

query 2 (slow)

FOR o in obs_geo_view
    SEARCH ANALYZER(GEO_CONTAINS(GEO_POLYGON(france), o.geoloc), "geopoint_pn")
    SORT o.date_obs DESC
    LIMIT 20000, 1000
    SORT o.date_obs DESC, o._key // secondary sort
    RETURN o._key

query 3 (still slow)

LET data = (
FOR o in obs_geo_view
    SEARCH ANALYZER(GEO_CONTAINS(GEO_POLYGON(france), o.geoloc), "geopoint_pn")
    SORT o.date_obs DESC
    LIMIT 20000, 1000
    RETURN {
        date_obs: o.date_obs,
        _key: o._key
    }
)
// return data // this would be quick, returning `date_obs` along with `_key` does not affect performance
FOR d IN data
    SORT d.date_obs DESC, d._key // secondary sort
    RETURN d._key

query 4 (even way slower but this is expected)

FOR o in obs_geo_view
    SEARCH ANALYZER(GEO_CONTAINS(GEO_POLYGON(france), o.geoloc), "geopoint_pn")
    SORT o.date_obs DESC, o._key // prevents the use of view's primary sort
    LIMIT 20000, 1000
    RETURN o._key

AQL explain and/or profile (if applicable):

profile query 1

Execution plan:
 Id   NodeType            Calls   Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1       1          0       0.00001   * ROOT
  3   EnumerateViewNode       1   21000          0       0.02472     - FOR o IN obs_geo_view SEARCH ANALYZER(GEO_CONTAINS({ "type" : "Polygon", "coordinates" : [ [ [ 2.9451529973202923, 42.56519369560516 ], [ 3.028384657289564, 43.16018921577199 ], [ 3.876766215111644, 43.5786612647569 ], [ 4.398984405007951, 43.45601573567811 ], [ 5.0093594124186325, 43.34006816741453 ], [ 5.747557513089987, 43.03040783742591 ], [ 6.303429422950643, 43.15940138076181 ], [ 7.986591947510675, 43.95432691596994 ], [ 7.25079300615576, 44.13676669088349 ], [ 6.799998138643161, 44.25008542230151 ], [ 6.990668431316976, 44.71231367327766 ], [ 6.6223368454342335, 45.02759185570113 ], [ 7.060450746338972, 45.15559025543743 ], [ 7.078599998021048, 45.48050306310927 ], [ 6.766156417033841, 45.73902899659615 ], [ 6.434484308273198, 46.288737743712915 ], [ 5.9607513357667585, 46.21272211846565 ], [ 5.7775084764502935, 46.33510065897019 ], [ 6.524532055475703, 46.816445588486886 ], [ 7.4732835380936535, 47.624328455048214 ], ... ] ] }, o.`geoloc`), "geopoint_pn")   /* view query */
  6   LimitNode               1    1000          0       0.00002       - LIMIT 20000, 1000
  7   CalculationNode         1    1000          0       0.00021       - LET #4 = o.`_key`   /* attribute expression */
  8   ReturnNode              1    1000          0       0.00001       - RETURN #4

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   remove-unnecessary-calculations
  2   handle-arangosearch-views
  3   remove-unnecessary-calculations-2

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Peak Mem [b]   Exec Time [s]
           0            0           0        21000               0 / 0          0        1409024         0.02628

Query Profile:
 Query Stage               Duration [s]
 initializing                   0.00000
 parsing                        0.00038
 optimizing ast                 0.00027
 loading collections            0.00002
 instantiating plan             0.00005
 optimizing plan                0.00049
 instantiating executors        0.00006
 executing                      0.02499
 finalizing                     0.00003

profile query 2

Execution plan:
 Id   NodeType            Calls     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1         1          0       0.00001   * ROOT
  3   EnumerateViewNode    3300   3299874          0       2.25569     - FOR o IN obs_geo_view SEARCH ANALYZER(GEO_CONTAINS({ "type" : "Polygon", "coordinates" : [ [ [ 2.9451529973202923, 42.56519369560516 ], [ 3.028384657289564, 43.16018921577199 ], [ 3.876766215111644, 43.5786612647569 ], [ 4.398984405007951, 43.45601573567811 ], [ 5.0093594124186325, 43.34006816741453 ], [ 5.747557513089987, 43.03040783742591 ], [ 6.303429422950643, 43.15940138076181 ], [ 7.986591947510675, 43.95432691596994 ], [ 7.25079300615576, 44.13676669088349 ], [ 6.799998138643161, 44.25008542230151 ], [ 6.990668431316976, 44.71231367327766 ], [ 6.6223368454342335, 45.02759185570113 ], [ 7.060450746338972, 45.15559025543743 ], [ 7.078599998021048, 45.48050306310927 ], [ 6.766156417033841, 45.73902899659615 ], [ 6.434484308273198, 46.288737743712915 ], [ 5.9607513357667585, 46.21272211846565 ], [ 5.7775084764502935, 46.33510065897019 ], [ 6.524532055475703, 46.816445588486886 ], [ 7.4732835380936535, 47.624328455048214 ], ... ] ] }, o.`geoloc`), "geopoint_pn") LET #2 = o.`date_obs`   /* view query with late materialization */
  5   SortNode                1     21000          0       0.07098       - SORT #2 DESC   /* sorting strategy: constrained heap */
  6   LimitNode               1      1000          0       0.00002       - LIMIT 20000, 1000
 12   MaterializeNode         1      1000          0       0.04827       - MATERIALIZE o
  8   CalculationNode         1      1000          0       0.00032       - LET #6 = o.`_key`   /* attribute expression */
  9   SortNode                1      1000          0       0.00030       - SORT #2 DESC, #6 ASC   /* sorting strategy: standard */
 11   ReturnNode              1      1000          0       0.00001       - RETURN #6

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   remove-redundant-calculations
  3   remove-unnecessary-calculations
  4   handle-arangosearch-views
  5   sort-limit
  6   late-document-materialization-arangosearch

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Peak Mem [b]   Exec Time [s]
           0            0           0      3299874               0 / 0          0        2359296         2.37682

Query Profile:
 Query Stage               Duration [s]
 initializing                   0.00000
 parsing                        0.00028
 optimizing ast                 0.00019
 loading collections            0.00001
 instantiating plan             0.00005
 optimizing plan                0.00052
 instantiating executors        0.00013
 executing                      2.37563
 finalizing                     0.00006

profile query 3

Execution plan:
 Id   NodeType            Calls     Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1         1          0       0.00001   * ROOT
  4   EnumerateViewNode    3300   3299874          0       2.14131     - FOR #18 IN obs_geo_view SEARCH ANALYZER(GEO_CONTAINS({ "type" : "Polygon", "coordinates" : [ [ [ 2.9451529973202923, 42.56519369560516 ], [ 3.028384657289564, 43.16018921577199 ], [ 3.876766215111644, 43.5786612647569 ], [ 4.398984405007951, 43.45601573567811 ], [ 5.0093594124186325, 43.34006816741453 ], [ 5.747557513089987, 43.03040783742591 ], [ 6.303429422950643, 43.15940138076181 ], [ 7.986591947510675, 43.95432691596994 ], [ 7.25079300615576, 44.13676669088349 ], [ 6.799998138643161, 44.25008542230151 ], [ 6.990668431316976, 44.71231367327766 ], [ 6.6223368454342335, 45.02759185570113 ], [ 7.060450746338972, 45.15559025543743 ], [ 7.078599998021048, 45.48050306310927 ], [ 6.766156417033841, 45.73902899659615 ], [ 6.434484308273198, 46.288737743712915 ], [ 5.9607513357667585, 46.21272211846565 ], [ 5.7775084764502935, 46.33510065897019 ], [ 6.524532055475703, 46.816445588486886 ], [ 7.4732835380936535, 47.624328455048214 ], ... ] ] }, #18.`geoloc`), "geopoint_pn") LET #17 = #18.`date_obs`   /* view query with late materialization */
  6   SortNode                1     21000          0       0.06780       - SORT #17 DESC   /* sorting strategy: constrained heap */
  7   LimitNode               1      1000          0       0.00002       - LIMIT 20000, 1000
 17   MaterializeNode         1      1000          0       0.00963       - MATERIALIZE #18
 13   CalculationNode         1      1000          0       0.00021       - LET #12 = #18.`_key`   /* attribute expression */
 14   SortNode                1      1000          0       0.00020       - SORT #17 DESC, #12 ASC   /* sorting strategy: standard */
 16   ReturnNode              1      1000          0       0.00001       - RETURN #12

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   inline-subqueries
  2   simplify-conditions
  3   move-calculations-up
  4   remove-redundant-calculations
  5   remove-unnecessary-calculations
  6   handle-arangosearch-views
  7   sort-limit
  8   late-document-materialization-arangosearch

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Peak Mem [b]   Exec Time [s]
           0            0           0      3299874               0 / 0          0        2359296         2.22176

Query Profile:
 Query Stage               Duration [s]
 initializing                   0.00000
 parsing                        0.00059
 optimizing ast                 0.00039
 loading collections            0.00002
 instantiating plan             0.00010
 optimizing plan                0.00117
 instantiating executors        0.00025
 executing                      2.21922
 finalizing                     0.00005

Dataset: ~22M in collection observations feeding the view

obs_geo_view ArangoSearch view description :

{
  "writebufferSizeMax": 33554432,
  "id": "42361429501",
  "storedValues": [],
  "name": "obs_geo_view",
  "type": "arangosearch",
  "consolidationPolicy": {
    "type": "tier",
    "segmentsBytesFloor": 2097152,
    "segmentsBytesMax": 5368709120,
    "segmentsMax": 10,
    "segmentsMin": 1,
    "minScore": 0
  },
  "writebufferActive": 0,
  "links": {
    "observations": {
      "analyzers": [
        "identity"
      ],
      "fields": {
        "author": {
          "fields": {
            "id": {}
          }
        },
        "project_id": {},
        "computed": {
          "fields": {
            "current_name": {},
            "valid": {}
          }
        },
        "geoloc": {
          "analyzers": [
            "geopoint_pn"
          ]
        },
        "images": {
          "fields": {
            "computed": {
              "fields": {
                "valid": {},
                "current_organ": {}
              }
            }
          }
        },
        "import": {},
        "date_obs": {},
        "date_created": {}
      },
      "includeAllFields": false,
      "storeValues": "none",
      "trackListPositions": false
    }
  },
  "commitIntervalMsec": 1000,
  "consolidationIntervalMsec": 1000,
  "globallyUniqueId": "h5CA042DAE734/42361429501",
  "cleanupIntervalStep": 2,
  "primarySort": [
    {
      "field": "date_obs",
      "asc": false
    }
  ],
  "primarySortCompression": "lz4",
  "writebufferIdle": 64
}

Size of your Dataset on disk: 15GB for collection observations feeding the view

Steps to reproduce

  1. Run query 1: result is very quick as expected (~30 ms), as sort statement used corresponds to the view's primary sort
  2. Run query 2: an additional sort is added after pagination, to ensure results order while trying to keep the benefit of the primary sort; result is slow (> 2s) and profile shows that EnumerateViewNode has a lot more items than it should, and remove-redundant-calculations rules is applied (maybe removes the first view-optimized sort statement ?)
  3. Run query 3 : try to force regular pagination before sorting again, by using a subquery; does not work, result is slow as in query 2 (> 2s) and profile shows the same behaviour
  4. try query 4 just for fun, but as expected this is even slower (38s), as the SORT statement prevents taking advantage of the view's primary sort

Problem: Extra sort after pagination should not break performance of steps before

Expected result: Pagination is executed quickly as expected thanks to the view primary sort, and extra sort is applied only on returned elements

Thank you

alexbakharew commented 2 months ago

Hi @matcho !

This behavior is expected since your index doesn't contain field _key: Therefore it should be materialized directly from the storage. As a result, it is way slower as expected.

Could you please try to add _key field to primarySort as well? Here is the documentation about how such view with 2 fields in primarySort could be created: https://docs.arangodb.com/3.11/index-and-search/arangosearch/performance/#primary-sort-order

matcho commented 2 months ago

Hi @alexbakharew

Thank you for your answer.

I understand that using a 2-field primarySort would certainly work − we'll try that as soon as possible, although it seems that primarySort cannot be changed on an existing view : const error: /primarySort must be equal to constant. Schema: {"allowedValue":[{"field":"date_obs","asc":false}]}, and view has to be re-created from scratch (no big deal).

What still seems strange is that returning a projection that includes _key is very fast, although _key is not an indexed field in the view : the following query runs in 165ms !

FOR o in obs_geo_view
    SEARCH ANALYZER(GEO_CONTAINS(GEO_POLYGON(france), o.geoloc), "geopoint_pn")
    SORT o.date_obs DESC
    LIMIT 20000, 1000
    RETURN {
        date_obs: o.date_obs,
        _key: o._key
    }

Once the 1000 resulting items, including date_obs and _key fields, are returned so fast, what keeps AQL from re-sorting them in-memory ? Sorting 1000 items should take almost no time. In queries mentioned in the original post above, it seems that the optimizer forces the second SORT statement to be executed before pagination, which is not what is expressed in the AQL query, especially query 3 with the sub-query.

So it still looks like an optimizer / AQL issue to me. We should be able to apply a secondary sort after pagination without modifying the view's primarySort. Besides, what if we want to sort after pagination on different fields, in different cases ? We would then have to build multiple views with a different primarySort in each of them, which doesn't seem reasonable.

Thank you

alexbakharew commented 2 months ago

Hi @matcho!

This indeed looks a little bit strange. I will investigate it and then get back to you.