Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.21k stars 150 forks source link

parquet metadata only queries perform full scans #2298

Open universalmind303 opened 4 months ago

universalmind303 commented 4 months ago

Describe the bug queries such as read_parquet().count_rows() should not do a full scan, and instead be should able to be fulfilled by the metadata only.

The need for the full scan should be optimized away during physical planning.

To Reproduce Steps to reproduce the behavior: daft.read_parquet().count_rows() **Expected behavior** Metadata only operations such ascount(*)orcount_rows()` usually can be fulfilled without needing to perform a full scan.

Additional context dump from daft.read_parquet('lineitem.parquet').explain(show_all=True)

== Unoptimized Logical Plan ==

* GlobScanOperator
|   Glob paths = [../Daft/lineitem.parquet]
|   Coerce int96 timestamp unit = Nanoseconds
|   IO config = S3 config = { Max connections = 8, Retry initial backoff ms = 1000, Connect timeout ms = 30000, Read timeout ms = 30000, Max retries = 25, Retry mode = adaptive, Anonymous = false, Use SSL = true, Verify SSL = true, Check hostname SSL = true, Requester pays = false, Force Virtual Addressing = false }, Azure config = { Anoynmous = false, Use SSL = true }, GCS config = { Anoynmous = false }
|   Use multithreading = true
|   File schema = l_orderkey#Int64, l_partkey#Int64, l_suppkey#Int64, l_linenumber#Int64, l_quantity#Int64, l_extendedprice#Float64, l_discount#Float64, l_tax#Float64, l_returnflag#Utf8, l_linestatus#Utf8, l_shipdate#Date, l_commitdate#Date, l_receiptdate#Date, l_shipinstruct#Utf8, l_shipmode#Utf8, comments#Utf8
|   Partitioning keys = []
|   Output schema = l_orderkey#Int64, l_partkey#Int64, l_suppkey#Int64, l_linenumber#Int64, l_quantity#Int64, l_extendedprice#Float64, l_discount#Float64, l_tax#Float64, l_returnflag#Utf8, l_linestatus#Utf8, l_shipdate#Date, l_commitdate#Date, l_receiptdate#Date, l_shipinstruct#Utf8, l_shipmode#Utf8, comments#Utf8

== Optimized Logical Plan ==

* GlobScanOperator
|   Glob paths = [../Daft/lineitem.parquet]
|   Coerce int96 timestamp unit = Nanoseconds
|   IO config = S3 config = { Max connections = 8, Retry initial backoff ms = 1000, Connect timeout ms = 30000, Read timeout ms = 30000, Max retries = 25, Retry mode = adaptive, Anonymous = false, Use SSL = true, Verify SSL = true, Check hostname SSL = true, Requester pays = false, Force Virtual Addressing = false }, Azure config = { Anoynmous = false, Use SSL = true }, GCS config = { Anoynmous = false }
|   Use multithreading = true
|   File schema = l_orderkey#Int64, l_partkey#Int64, l_suppkey#Int64, l_linenumber#Int64, l_quantity#Int64, l_extendedprice#Float64, l_discount#Float64, l_tax#Float64, l_returnflag#Utf8, l_linestatus#Utf8, l_shipdate#Date, l_commitdate#Date, l_receiptdate#Date, l_shipinstruct#Utf8, l_shipmode#Utf8, comments#Utf8
|   Partitioning keys = []
|   Output schema = l_orderkey#Int64, l_partkey#Int64, l_suppkey#Int64, l_linenumber#Int64, l_quantity#Int64, l_extendedprice#Float64, l_discount#Float64, l_tax#Float64, l_returnflag#Utf8, l_linestatus#Utf8, l_shipdate#Date, l_commitdate#Date, l_receiptdate#Date, l_shipinstruct#Utf8, l_shipmode#Utf8, comments#Utf8

== Physical Plan ==

* TabularScan:
|   Num Scan Tasks = 21
|   Estimated Scan Bytes = 2102749617
|   Clustering spec = { Num partitions = 21 }
jaychia commented 4 months ago

Indeed. We need a mechanism for Daft to represent tables with "no columns", but still have valid metadata.

Currently the scan will naively read the first column I believe.