GlareDB / glaredb

GlareDB: An analytics DBMS for distributed data
https://glaredb.com
MIT License
721 stars 41 forks source link

Hybrid Execution execs & `EXPLAIN` #1645

Open universalmind303 opened 1 year ago

universalmind303 commented 1 year ago

Explain Execs

TL;DR, we need to think about the plans in a way that they are easy to explain when running EXPLAIN. They should be self describing & contain all relevant information about remote/local execution.

I initially was working on Explain for RPC & it brought up some questions on how we can properly display this information when running an explain such as

> explain select * from csv_scan('https://raw.githubusercontent.com/GlareDB/glaredb/main/testdata/sqllogictests_datasources_common/data/bikeshare_stations.csv');
┌───────────────┬─────────────────────────────────────────────────────────────────────┐
│ plan_type     │ plan                                                                │
│ ──            │ ──                                                                  │
│ Utf8          │ Utf8                                                                │
╞═══════════════╪═════════════════════════════════════════════════════════════════════╡
│ logical_plan  │ TableScan: csv_scan projection=[station_id, name, status, address,  │
│               │ alternate_name, city_asset_number, property_type, number_of_docks,  │
│               │ power_type, footprint_length, footprint_width, notes,               │
│               │ council_district, modified_date]                                    │
│ physical_plan │ CsvExec: file_groups={1 group: [[]]}, projection=[station_id, name, │
│               │ status, address, alternate_name, city_asset_number, property_type,  │
│               │ number_of_docks, power_type, footprint_length, footprint_width,     │
│               │ notes, council_district, modified_date], has_header=true            │
│               │                                                                

I think the logicalplan explain makes sense (for the most part). We could maybe clarify by wrapping this in something like RemoteTableScan.

I think we need further clarity on the physical plan execs. such as

│ physical_plan │ RemoteExec:       |
│               │     CsvExec {...} |              

which could be flattened to:

│ physical_plan │ RemoteCsvExec {...} |   

Hybrid execution Execs

In order to properly EXPLAIN all of these paths, we need a very clear understanding of the hybrid execution model & the resulting execs.

I know we have the following

but not sure I fully understand how all of those are being used.


My understanding of the ideal execution for various execs

🏠 = local 📡 = remote

Single source execs

Data Source Optimal Run Location
Local 🏠
Remote 📡
External (s3, http, ...) 📡

binary join execs (we can skip this section & just rely on the multi join logic for now)

we could be slightly more sophisticated when doing a non nested join.

Some assumptions

Local Remote External
Local 🏠 📡 🏠
Remote 📡 📡 📡
External 🏠 📡 📡

Multi join execs

this is where things get really tricky & probably best to keep this portion rather simple for now & create some optimization rules later on

scsmithr commented 1 year ago

Quick table on where we should be doing certain actions:

Table/Function Resolve Dispatch Execute
read_* (read databases) remote remote remote
*_scan (s3/gcs/http) remote remote remote
*_scan (local) local local local
Native tables local remote remote
External tables local remote remote
Tables using an external database remote remote remote
Temp tables local local local

Notes:

greyscaled commented 1 year ago

Readable explains for user + debugging.

We can move it to next to improve upon the intial RPC impl.

universalmind303 commented 1 year ago

So i think this is blocked as long as we are using the references for our remote execs. Without (de)serializing the inner plan, the remote exec nodes are just black-boxes.