Hybrid Execution execs & `EXPLAIN`

universalmind303 commented 1 year ago

Explain Execs

TL;DR, we need to think about the plans in a way that they are easy to explain when running EXPLAIN. They should be self describing & contain all relevant information about remote/local execution.

I initially was working on Explain for RPC & it brought up some questions on how we can properly display this information when running an explain such as

> explain select * from csv_scan('https://raw.githubusercontent.com/GlareDB/glaredb/main/testdata/sqllogictests_datasources_common/data/bikeshare_stations.csv');
┌───────────────┬─────────────────────────────────────────────────────────────────────┐
│ plan_type     │ plan                                                                │
│ ──            │ ──                                                                  │
│ Utf8          │ Utf8                                                                │
╞═══════════════╪═════════════════════════════════════════════════════════════════════╡
│ logical_plan  │ TableScan: csv_scan projection=[station_id, name, status, address,  │
│               │ alternate_name, city_asset_number, property_type, number_of_docks,  │
│               │ power_type, footprint_length, footprint_width, notes,               │
│               │ council_district, modified_date]                                    │
│ physical_plan │ CsvExec: file_groups={1 group: [[]]}, projection=[station_id, name, │
│               │ status, address, alternate_name, city_asset_number, property_type,  │
│               │ number_of_docks, power_type, footprint_length, footprint_width,     │
│               │ notes, council_district, modified_date], has_header=true            │
│               │

I think the logicalplan explain makes sense (for the most part). We could maybe clarify by wrapping this in something like RemoteTableScan.

I think we need further clarity on the physical plan execs. such as

│ physical_plan │ RemoteExec:       |
│               │     CsvExec {...} |

which could be flattened to:

│ physical_plan │ RemoteCsvExec {...} |

Hybrid execution Execs

In order to properly EXPLAIN all of these paths, we need a very clear understanding of the hybrid execution model & the resulting execs.

I know we have the following

SendRecvJoinExec
ClientExchangeSendExec
ClientExchangeRecvExec
RemoteExecutionExec
RemoteScanExec

but not sure I fully understand how all of those are being used.

My understanding of the ideal execution for various execs

🏠 = local 📡 = remote

Single source execs

Data Source	Optimal Run Location
Local	🏠
Remote	📡
External (s3, http, ...)	📡

binary join execs (we can skip this section & just rely on the multi join logic for now)

we could be slightly more sophisticated when doing a non nested join.

Some assumptions

Any data from a remote data source should always run on remote & data is broadcast to the remote server (we can later fine tune the join/broadcast execs).
Any query referencing only local data should always run locally
joins on external data should run on the environment of the other half of the join.
- in the event of both sides being external, we run it on the remote instance.

	Local	Remote	External
Local	🏠	📡	🏠
Remote	📡	📡	📡
External	🏠	📡	📡

Multi join execs

this is where things get really tricky & probably best to keep this portion rather simple for now & create some optimization rules later on

any plan with all local nodes executes locally
anything else runs on remote (with local nodes running locally & broadcast to remote).

scsmithr commented 1 year ago

Quick table on where we should be doing certain actions:

Resolve: How we resolve tables (check if they exist or not)
Dispatch: Loading the table provider

Table/Function	Resolve	Dispatch	Execute
`read_*` (read databases)	remote	remote	remote
`*_scan` (s3/gcs/http)	remote	remote	remote
`*_scan` (local)	local	local	local
Native tables	local	remote	remote
External tables	local	remote	remote
Tables using an external database	remote	remote	remote
Temp tables	local	local	local

External tables: Tables created with CREATE EXTERNAL TABLE ...
Tables using an external database: Tables referenced through a database created with CREATE EXTERNAL DATABASE ...

Notes:

Temp tables can be moved to entirely remote if that's easier.

greyscaled commented 1 year ago

Readable explains for user + debugging.

We can move it to next to improve upon the intial RPC impl.