EnterpriseDB / mysql_fdw

PostgreSQL foreign data wrapper for MySQL
Other
521 stars 160 forks source link

[Feature request] Support Subquery pushdown #297

Open smilingthax opened 3 months ago

smilingthax commented 3 months ago

Motivation

Examples:

  1. Simple case:

    SELECT 
    id,
    (SELECT COUNT(*) FROM tbl2 WHERE tbl2.parent_id = tbl1.id) 
    FROM tbl1;  -- + WHERE, ...

    The query plain currently looks somewhat like this:

    Foreign Scan on tbl1
    Output: tbl1.id, (SubPlan 1)
    Remote query: SELECT `id` FROM `s`.`tbl1`   -- + WHERE, ORDER BY, ...
    SubPlan 1
     ->  Foreign Scan
           Output: (count(tbl2.parent_id))
           Relations: Aggregate on (s.tbl2)
           Remote query: SELECT count(`id`) FROM `s`.`tbl2` WHERE ((`parent_id` = ?))

    I.e. for each row of tbl1 a separate query is send to the remote database.

  2. Similar case (worse performance):

    SELECT 
    id,
    EXISTS (SELECT FROM tbl2 WHERE tbl2.parent_id = tbl1.id) 
    FROM tbl1;

    Query Plan:

    Foreign Scan on tbl1
    Output: tbl1.id, (hashed SubPlan 2)
    Remote query: SELECT `id` FROM `s`.`tbl1`   -- + WHERE, ORDER BY, ...
    SubPlan 2
     ->  Foreign Scan on s.tbl2
           Output: tbl2.parent_id
           Remote query: SELECT `parent_id` FROM `s`.`tbl2`

    This unfortunately tries to download the complete tbl2!

  3. Another version (might be more difficult to implement?):

    SELECT 
    id,
    r1.*
    FROM tbl1,
    LATERAL (SELECT COUNT(*) FROM tbl2 WHERE tbl2.parent_id = tbl1.id) r1;

    Query Plan:

    Nested Loop
    Output: tbl1.id, (count(tbl2.parent_id))
    ->  Foreign Scan on s.tbl1
         Output: tbl1.id, ... more fields...
         Remote query: SELECT `id` FROM `s`.`tbl1`   -- + WHERE, ORDER BY, ...
    ->  Foreign Scan
         Output: (count(*))
         Relations: Aggregate on ("s".tbl2)
         Remote query: SELECT count(*) FROM `s`.`tbl2` WHERE ((`parent_id` = ?))

    This is basically the same as 1.

Partial workaround:

SELECT
  tbl1.id,
  COUNT(tbl2.id)
FROM tbl1
LEFT JOIN tbl2 ON tbl2.parent_id = tbl1.id
GROUP BY 1

Query Plan: Single Foreign Scan, everything is pushed down

However, this no longer works when additional fields, e.g. tbl1.*, should be retrieved.

One might try this:

WITH t0 AS ( -- NOT MATERIALIZED does not help
  SELECT
    tbl1.id,
    COUNT(tbl2.id)
  FROM table
  LEFT JOIN tbl2 ON tbl2.parent_id = tbl1.id
  GROUP BY 1
  ORDER BY 1
)
SELECT
  tbl1.*,
  count
FROM t0
JOIN tbl1 ON tbl1.id = t0.id

But the resulting query plan (Merge Join between Foreign Scan of tbl1 and Materialize of Foreign Scan from t0) is not feasible for bigger tables, although SELECT * FROM t0 and SELECT * FROM tbl1 are quite fast by themselves...

Expected Result

The aggregation is evaluated on the remote mysql server without requiring a separate query for each row of tbl1.
Also, EXISTS should probably keep the WHERE, instead of downloading the whole table.