arangodb / arangodb

🥑 ArangoDB is a native multi-model database with flexible data models for documents, graphs, and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript extensions.
https://www.arangodb.com
Other
13.55k stars 838 forks source link

Correlated subquery in Arangosearch View is processed incorrectly. #10583

Open edwardflo opened 4 years ago

edwardflo commented 4 years ago

My Environment

The AQL statements below should result in the same result. It looks for movie records where startYear = 1990 and genre = "Comedy". For each one found, it uses a subqiery to fetch records (from title-principals-view, an ArangoSearch view) for the same movie that have a category = "composer". When subquery is in the SEARCh block, it ignores p.tconst == root._key (as you can see in the profile) and the subquery is actually only executing p.category == "composer" instead of p.tconst == root._key AND p.category == "composer". It ends up returning all records in title-principals-view where category = "composer" (which is 1180041 records).

When subquery is in the FILTER block, it works as expected.

Correlated subquery inside a SEARCH block

FOR root IN `title-basics-view`
    SEARCH root.startYear == 1990
    AND root.genres == "Comedy"
    AND LENGTH(
    FOR p IN `title-principals-view`
        SEARCH p.tconst == root._key
        AND p.category == "composer"

    RETURN p
    ) > 0

    COLLECT WITH COUNT INTO length
    RETURN length

Result

[7914]

Profile

Query String (303 chars, cacheable: false):
 FOR root IN `title-basics-view`
     SEARCH root.startYear == 1990
     AND root.genres == "Comedy"
     AND LENGTH(
     FOR p IN `title-principals-view`
         SEARCH p.tconst == root._key
         AND p.category == "composer"
     RETURN p
     ) > 0
     COLLECT WITH COUNT INTO length
     RETURN length

Execution plan:
 Id   NodeType            Calls     Items   Runtime [s]   Comment
  1   SingletonNode           1         1       0.00000   * ROOT
  5   SubqueryNode            1         1       5.02540     - LET #2 = ...   /* const subquery */
  2   SingletonNode           1         1       0.00000       * ROOT
  3   EnumerateViewNode    1181   1180041       4.72990         - FOR p IN title-principals-view SEARCH (p.`category` == "composer")   /* view query */
  4   ReturnNode           1181   1180041       0.29411           - RETURN p
  6   EnumerateViewNode       2      7914       0.00494     - FOR root IN title-basics-view SEARCH ((LENGTH(#2) > 0) && (root.`startYear` == 1990) && (root.`genres` == "Comedy"))   /* view query */
  7   CollectNode             1         1       0.00001       - COLLECT  WITH COUNT INTO length   /* count */
  8   ReturnNode              1         1       0.00000       - RETURN length

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   handle-arangosearch-views

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Exec Time [s]
           0            0           0      1187955          0         5.22913

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00001
 parsing                    0.00037
 optimizing ast             0.00004
 loading collections        0.00007
 instantiating plan         0.00009
 optimizing plan            0.00037
 executing                  5.03038
 finalizing                 0.19776

Correlated subquery inside a FILTER block

FOR root IN `title-basics-view`
    SEARCH root.startYear == 1990
    AND root.genres == "Comedy"
    FILTER LENGTH(
    FOR p IN `title-principals-view`
        SEARCH p.tconst == root._key
        AND p.category == "composer"

    RETURN p
    ) > 0

    COLLECT WITH COUNT INTO length
    RETURN length

Result

[2028]

Profile

Query String (305 chars, cacheable: false):
 FOR root IN `title-basics-view`
     SEARCH root.startYear == 1990
     AND root.genres == "Comedy"
     FILTER LENGTH(
     FOR p IN `title-principals-view`
         SEARCH p.tconst == root._key
         AND p.category == "composer"
     RETURN p
     ) > 0
     COLLECT WITH COUNT INTO length
     RETURN length

Execution plan:
 Id   NodeType            Calls   Items   Runtime [s]   Comment
  1   SingletonNode           1       1       0.00001   * ROOT
  2   EnumerateViewNode       8    7914       0.04331     - FOR root IN title-basics-view SEARCH ((root.`startYear` == 1990) && (root.`genres` == "Comedy"))   /* view query */
  6   SubqueryNode            8    7914       0.32824       - LET #2 = ...   /* subquery */
  3   SingletonNode        7914    7914       0.00899         * ROOT
 11   CalculationNode      7914    7914       0.01632           - LET #7 = true   /* json expression */   /* const assignment */
  4   EnumerateViewNode    7914    2325       0.24814           - FOR p IN title-principals-view SEARCH ((p.`tconst` == root.`_key`) && (p.`category` == "composer"))   /* view query */
  5   ReturnNode           7914    2325       0.01759             - RETURN #7
  7   CalculationNode         8    7914       0.00331       - LET #5 = (LENGTH(#2) > 0)   /* simple expression */
  8   FilterNode              3    2028       0.00032       - FILTER #5
  9   CollectNode             1       1       0.00003       - COLLECT  WITH COUNT INTO length   /* count */
 10   ReturnNode              1       1       0.00000       - RETURN length

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   optimize-subqueries
  2   move-calculations-up-2
  3   handle-arangosearch-views

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Exec Time [s]
           0            0           0        10239       5886         0.37634

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00001
 parsing                    0.00028
 optimizing ast             0.00003
 loading collections        0.00004
 instantiating plan         0.00007
 optimizing plan            0.00045
 executing                  0.37524
 finalizing                 0.00019
edwardflo commented 4 years ago

@Simran-B Any observations or insights so far on this issue? Thx

edwardflo commented 4 years ago

Hi @Simran-B Any luck with this?

edwardflo commented 4 years ago

@Simran-B Sorry to bug you. Has anyone had a chance to look at this at all?

dremekie commented 4 years ago

@Simran-B @jsteemann Could someone please give an update on this? This is something that is impacting our team as well. Is it still in analysis phase?

Simran-B commented 4 years ago

As explained on Slack by @gnusi, it is considered a bug that the query does not raise an error for such an unsupported query.

It is basically impossible to correctly process a query with a dependent sub-query. It's unclear what should be executed first: sub-query or main query. So the only possibility for now is to use FILTER.

FILTER always means post-processing, that's why it's perfectly valid to use dependent sub-queries. The SEARCH condition is a part of the EnumerateViewNode in the execution plan, therefore it's unclear what should be done first (like the chicken or egg problem).

The underlying search engine already supports indexing the city and country fields with positions, so that you can ask it to find all documents where the positions of city and country match. It's not exposed in AQL yet, but this is how it could look like:

FOR d IN myView
  SEARCH IS_SAME_POSITION(d.visits.country == 'Canada', d.visits.city == 'Paris')
  RETURN d  

The query would match:

{
  visits : [ { country : "Canada", city: "Paris" }, { country: "Germany", city: "Cologne" } ]
}

but not:

{
  visits : [ { country: "France", city: "Paris" }, { country: "Canada", city: "Toronto" } ]
}