apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.67k stars 1.06k forks source link

Datafusion v19.rc1 scan parquet 20x slower than DuckDB v0.6.1 on 15GB ClickBench data #5404

Open jychen7 opened 1 year ago

jychen7 commented 1 year ago

Describe the problem This is NOT a bug, but an potential improvement goal

Datafusion v19.rc1 by default turn on repartition_file_scans at https://github.com/apache/arrow-datafusion/pull/5295

with my local Macbook Pro (2.6 GHz 6-Core Intel Core i7, 32 GB 2667 MHz DDR4), for following query on clickbench 14GB hits.parquet:

To Reproduce Download data file

wget --continue https://datasets.clickhouse.com/hits_compatible/hits.parquet

Prepare SQL create a file called create.sql

CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'hits.parquet';

create a file called q23_no_order_limit_1.sql

SELECT * FROM hits WHERE "URL" LIKE '%google%' limit 1;

Datafusion

git clone https://github.com/apache/arrow-datafusion.git
git checkout 19.0.0-rc1
cd datafusion-cli
cargo build --release

target/release/datafusion-cli -f create.sql q23_no_order_limit_1.sql
// output: 1 row in set. Query took 12.343 seconds

DuckDB

brew install duckdb
duckdb
> .timer on
> SELECT * FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%' LIMIT 1;
// output: Run Time (s): real 0.566 user 1.876031 sys 0.357483

Expected behavior

  1. with single core, datafusion-cli tooks 2s (like cpu time of DuckDB)
  2. with multi cores, datafusion-cli tooks 0.6s (like real time of DuckDB)
jychen7 commented 1 year ago

Additional context Datafusion Plan

> explain ANALYZE SELECT * FROM hits WHERE "URL" LIKE '%google%' limit 1;
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | ProjectionExec: expr=[WatchID@0 as WatchID, JavaEnable@1 as JavaEnable, Title@2 as Title, GoodEvent@3 as GoodEvent, EventTime@4 as EventTime, EventDate@5 as EventDate, CounterID@6 as CounterID, ClientIP@7 as ClientIP, RegionID@8 as RegionID, UserID@9 as UserID, CounterClass@10 as CounterClass, OS@11 as OS, UserAgent@12 as UserAgent, URL@13 as URL, Referer@14 as Referer, IsRefresh@15 as IsRefresh, RefererCategoryID@16 as RefererCategoryID, RefererRegionID@17 as RefererRegionID, URLCategoryID@18 as URLCategoryID, URLRegionID@19 as URLRegionID, ResolutionWidth@20 as ResolutionWidth, ResolutionHeight@21 as ResolutionHeight, ResolutionDepth@22 as ResolutionDepth, FlashMajor@23 as FlashMajor, FlashMinor@24 as FlashMinor, FlashMinor2@25 as FlashMinor2, NetMajor@26 as NetMajor, NetMinor@27 as NetMinor, UserAgentMajor@28 as UserAgentMajor, UserAgentMinor@29 as UserAgentMinor, CookieEnable@30 as CookieEnable, JavascriptEnable@31 as JavascriptEnable, IsMobile@32 as IsMobile, MobilePhone@33 as MobilePhone, MobilePhoneModel@34 as MobilePhoneModel, Params@35 as Params, IPNetworkID@36 as IPNetworkID, TraficSourceID@37 as TraficSourceID, SearchEngineID@38 as SearchEngineID, SearchPhrase@39 as SearchPhrase, AdvEngineID@40 as AdvEngineID, IsArtifical@41 as IsArtifical, WindowClientWidth@42 as WindowClientWidth, WindowClientHeight@43 as WindowClientHeight, ClientTimeZone@44 as ClientTimeZone, ClientEventTime@45 as ClientEventTime, SilverlightVersion1@46 as SilverlightVersion1, SilverlightVersion2@47 as SilverlightVersion2, SilverlightVersion3@48 as SilverlightVersion3, SilverlightVersion4@49 as SilverlightVersion4, PageCharset@50 as PageCharset, CodeVersion@51 as CodeVersion, IsLink@52 as IsLink, IsDownload@53 as IsDownload, IsNotBounce@54 as IsNotBounce, FUniqID@55 as FUniqID, OriginalURL@56 as OriginalURL, HID@57 as HID, IsOldCounter@58 as IsOldCounter, IsEvent@59 as IsEvent, IsParameter@60 as IsParameter, DontCountHits@61 as DontCountHits, WithHash@62 as WithHash, HitColor@63 as HitColor, LocalEventTime@64 as LocalEventTime, Age@65 as Age, Sex@66 as Sex, Income@67 as Income, Interests@68 as Interests, Robotness@69 as Robotness, RemoteIP@70 as RemoteIP, WindowName@71 as WindowName, OpenerName@72 as OpenerName, HistoryLength@73 as HistoryLength, BrowserLanguage@74 as BrowserLanguage, BrowserCountry@75 as BrowserCountry, SocialNetwork@76 as SocialNetwork, SocialAction@77 as SocialAction, HTTPError@78 as HTTPError, SendTiming@79 as SendTiming, DNSTiming@80 as DNSTiming, ConnectTiming@81 as ConnectTiming, ResponseStartTiming@82 as ResponseStartTiming, ResponseEndTiming@83 as ResponseEndTiming, FetchTiming@84 as FetchTiming, SocialSourceNetworkID@85 as SocialSourceNetworkID, SocialSourcePage@86 as SocialSourcePage, ParamPrice@87 as ParamPrice, ParamOrderID@88 as ParamOrderID, ParamCurrency@89 as ParamCurrency, ParamCurrencyID@90 as ParamCurrencyID, OpenstatServiceName@91 as OpenstatServiceName, OpenstatCampaignID@92 as OpenstatCampaignID, OpenstatAdID@93 as OpenstatAdID, OpenstatSourceID@94 as OpenstatSourceID, UTMSource@95 as UTMSource, UTMMedium@96 as UTMMedium, UTMCampaign@97 as UTMCampaign, UTMContent@98 as UTMContent, UTMTerm@99 as UTMTerm, FromTag@100 as FromTag, HasGCLID@101 as HasGCLID, RefererHash@102 as RefererHash, URLHash@103 as URLHash, CLID@104 as CLID], metrics=[output_rows=1, elapsed_compute=21.342µs, spill_count=0, spilled_bytes=0, mem_used=0] |
|                   |   GlobalLimitExec: skip=0, fetch=1, metrics=[output_rows=1, elapsed_compute=55.749µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|                   |     CoalescePartitionsExec, metrics=[output_rows=410, elapsed_compute=19.372µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                   |       CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=410, elapsed_compute=25.377822ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                   |         FilterExec: URL@13 LIKE %google%, metrics=[output_rows=10767, elapsed_compute=12.106422195s, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|                   |           ParquetExec: limit=None, partitions={12 groups: [[hits.parquet:0..1231664704], [hits.parquet:1231664704..2463329408], [hits.parquet:2463329408..3694994112], [hits.parquet:3694994112..4926658816], [hits.parquet:4926658816..6158323520], [hits.parquet:6158323520..7389988224], [hits.parquet:7389988224..8621652928], [hits.parquet:8621652928..9853317632], [hits.parquet:9853317632..11084982336], [hits.parquet:11084982336..12316647040], [hits.parquet:12316647040..13548311744], [hits.parquet:13548311744..14779976446]]}, predicate=URL LIKE Utf8("%google%"), projection=[WatchID, JavaEnable, Title, GoodEvent, EventTime, EventDate, CounterID, ClientIP, RegionID, UserID, CounterClass, OS, UserAgent, URL, Referer, IsRefresh, RefererCategoryID, RefererRegionID, URLCategoryID, URLRegionID, ResolutionWidth, ResolutionHeight, ResolutionDepth, FlashMajor, FlashMinor, FlashMinor2, NetMajor, NetMinor, UserAgentMajor, UserAgentMinor, CookieEnable, JavascriptEnable, IsMobile, MobilePhone, MobilePhoneModel, Params, IPNetworkID, TraficSourceID, SearchEngineID, SearchPhrase, AdvEngineID, IsArtifical, WindowClientWidth, WindowClientHeight, ClientTimeZone, ClientEventTime, SilverlightVersion1, SilverlightVersion2, SilverlightVersion3, SilverlightVersion4, PageCharset, CodeVersion, IsLink, IsDownload, IsNotBounce, FUniqID, OriginalURL, HID, IsOldCounter, IsEvent, IsParameter, DontCountHits, WithHash, HitColor, LocalEventTime, Age, Sex, Income, Interests, Robotness, RemoteIP, WindowName, OpenerName, HistoryLength, BrowserLanguage, BrowserCountry, SocialNetwork, SocialAction, HTTPError, SendTiming, DNSTiming, ConnectTiming, ResponseStartTiming, ResponseEndTiming, FetchTiming, SocialSourceNetworkID, SocialSourcePage, ParamPrice, ParamOrderID, ParamCurrency, ParamCurrencyID, OpenstatServiceName, OpenstatCampaignID, OpenstatAdID, OpenstatSourceID, UTMSource, UTMMedium, UTMCampaign, UTMContent, UTMTerm, FromTag, HasGCLID, RefererHash, URLHash, CLID], metrics=[output_rows=73363236, elapsed_compute=12ns, spill_count=0, spilled_bytes=0, mem_used=0, predicate_evaluation_errors=0, pushdown_rows_filtered=0, row_groups_pruned=0, bytes_scanned=11496382994, page_index_rows_filtered=0, num_predicate_creation_errors=0, time_elapsed_scanning_total=151.775198747s, pushdown_eval_time=24ns, time_elapsed_processing=133.355923578s, time_elapsed_opening=1.135712999s, page_index_eval_time=24ns, time_elapsed_scanning_until_data=2.005873968s]                                                                                                                                                                                      |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set. Query took 12.817 seconds.

DuckDB Plan

> explain ANALYZE SELECT * FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%' LIMIT 1;

┌─────────────────────────────┐
│┌───────────────────────────┐│
│└───────────────────────────┘│
└─────────────────────────────┘
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
explain ANALYZE SELECT * FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%' LIMIT 1;
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         Total Time: 0.475s        ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           LIMIT           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             1             │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          WatchID          │
│         JavaEnable        │
│           Title           │
│         GoodEvent         │
│         EventTime         │
│         EventDate         │
│         CounterID         │
│          ClientIP         │
│          RegionID         │
│           UserID          │
│        CounterClass       │
│             OS            │
│         UserAgent         │
│            URL            │
│          Referer          │
│         IsRefresh         │
│     RefererCategoryID     │
│      RefererRegionID      │
│       URLCategoryID       │
│        URLRegionID        │
│      ResolutionWidth      │
│      ResolutionHeight     │
│      ResolutionDepth      │
│         FlashMajor        │
│         FlashMinor        │
│        FlashMinor2        │
│          NetMajor         │
│          NetMinor         │
│       UserAgentMajor      │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│  contains(URL, 'google')  │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             29            │
│          (0.07s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│        PARQUET_SCAN       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            URL            │
│          WatchID          │
│         JavaEnable        │
│           Title           │
│         GoodEvent         │
│         EventTime         │
│         EventDate         │
│         CounterID         │
│          ClientIP         │
│          RegionID         │
│           UserID          │
│        CounterClass       │
│             OS            │
│         UserAgent         │
│          Referer          │
│         IsRefresh         │
│     RefererCategoryID     │
│      RefererRegionID      │
│       URLCategoryID       │
│        URLRegionID        │
│      ResolutionWidth      │
│      ResolutionHeight     │
│      ResolutionDepth      │
│         FlashMajor        │
│         FlashMinor        │
│        FlashMinor2        │
│          NetMajor         │
│          NetMinor         │
│       UserAgentMajor      │
└───────────────────────────┘
Run Time (s): real 0.590 user 1.888174 sys 0.381627
jychen7 commented 1 year ago

FilterExec: URL@13 LIKE %google%, metrics=[output_rows=10767

looks like Datafusion does no push down limit to parquet scanner?

update: seems duplicate with

sundy-li commented 1 year ago

TopK is a partial factor but it is not the best.

  1. Lazy projection(aka Later projection) can improve this case mostly, with this we just fetch URL column at the first query and apply the order limit then projection other columns by rowids.
  2. URL is a large binary column in the hits dataset, duckdb optimized reading parquet to it's memory model (reused the original buffer). You can prove that by select max(URL) from table
jychen7 commented 1 year ago

Lazy projection(aka Later projection) can improve this case mostly, with this we just fetch URL column at the first query and apply the order limit then projection other columns by rowids.

@sundy-li , you are right, select one URL column makes datafusion and duckdb closer, 1.75x.

# datafusion v19.rc1
> SELECT "URL" FROM hits WHERE "URL" LIKE '%google%';
// 15911 rows in set. Query took 5.626 seconds

# duckdb v0.6.1
> SELECT URL FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%';
// 15911 rows. Run Time (s): real 3.207 user 34.520757 sys 1.555060

URL is a large binary column in the hits dataset, duckdb optimized reading parquet to it's memory model (reused the original buffer). You can prove that by select max(URL) from table

@sundy-li thanks for the info. Do you happen to know the code/blog link to "reused the original buffer"? I did a test and found datafusion and duckdb performs basically same in SELECT max("URL") FROM hits

# datafusion v19.rc1
SELECT max("URL") FROM hits;
+-----------------------------------------+
| MAX(hits.URL)                           |
+-----------------------------------------+
| https://yugra-advert2792270][to]=&input |
+-----------------------------------------+
1 row in set. Query took 2.726 seconds.

# duckdb v0.6.1
┌─────────────────────────────────────────┐
│               max("URL")                │
│                 varchar                 │
├─────────────────────────────────────────┤
│ https://yugra-advert2792270][to]=&input │
└─────────────────────────────────────────┘
Run Time (s): real 2.746 user 28.837000 sys 2.205152
sundy-li commented 1 year ago

Do you happen to know the code/blog link to "reused the original buffer"

I'm sorry, arrow-rs already implement it, arrow2 had this issue.

jychen7 commented 1 year ago

I'm sorry, arrow-rs already implement it, arrow2 had this issue.

@sundy-li TIL, thank you

sundy-li commented 1 year ago

@jychen7 I checked in my 16-core linux with SSD, duckdb read parquet still faster.

duckdb v0.6.0

D CREATE VIEW hits AS
> SELECT *
> REPLACE
> (epoch_ms(EventTime * 1000) AS EventTime,
>  DATE '1970-01-01' + INTERVAL (EventDate) DAYS AS EventDate)
> FROM read_parquet('hits.parquet', binary_as_string=True);
D
D .timer on
D select count(1),  max(URL) from hits;
┌──────────┬─────────────────────────────────────────┐
│ count(1) │               max("URL")                │
│  int64   │                 varchar                 │
├──────────┼─────────────────────────────────────────┤
│ 99997497 │ https://yugra-advert2792270][to]=&input │
└──────────┴─────────────────────────────────────────┘
Run Time (s): real 0.957 user 21.641735 sys 4.245200

datafusion:

❯ select max("URL") from hits;
+-----------------------------------------+
| MAX(hits.URL)                           |
+-----------------------------------------+
| https://yugra-advert2792270][to]=&input |
+-----------------------------------------+
1 row in set. Query took 2.849 seconds.

Do you happen to know the code/blog link to "reused the original buffer

I used to look at these codes, duckdb's memory model for String column did not strictly follow as Arrow.

Not sure but may related to: https://github.com/duckdb/duckdb/blob/master/src/common/types/vector.cpp#L1428-L1436

If the original parquet column data is plain encoding, it could directly use the io buffer by reference without extra allocation and clone.

tustvold commented 1 year ago

Lazy projection(aka Later projection) can improve this case mostly

Support for this was added to parquet by @thinkharderdev not too long ago, it may just be a case of hooking it up - https://github.com/apache/arrow-rs/pull/3633

Edit: perhaps you may be able to try with https://github.com/apache/arrow-datafusion/pull/5416 and the various predicate pushdown features enabled on ParquetOptions

jychen7 commented 1 year ago

I checked in my 16-core linux with SSD, duckdb read parquet still faster. DuckDB v0.6.0: 0.957s Datafusion 2.849s

@sundy-li my macOS only have 6 Core, but inspiring by your message, I find out github build version is faster than brew one (Brew is macOS package manager). More details here: https://github.com/duckdb/duckdb/issues/6495

jychen7 commented 1 year ago

perhaps you may be able to try with https://github.com/apache/arrow-datafusion/pull/5416 and the various predicate pushdown features enabled on ParquetOptions

@tustvold thank you, nice work! let me try main (as of now, c676d1026f8fa25c1495ec139b8a379ce1f2f86b) which includes #5416


update: yes, 2x faster

before after %
GlobalLimitExec 55 µs 86 µs
CoalescePartitionsExec 19 µs 20 µs
CoalesceBatchesExec 25 ms 2 ms
FilterExec 12000 ms 18 ms
ParquetExec pushdown_rows_filtered 0 90983466 rows are filtered in parquet reader
ParquetExec time_elapsed_scanning_total 151 s 73 s 200% faster

I locally turn on pushdown_filters: bool, default = true, the config was https://github.com/apache/arrow-datafusion/blob/c676d1026f8fa25c1495ec139b8a379ce1f2f86b/datafusion/common/src/config.rs#L248-L250

plan result

❯ explain SELECT * FROM hits WHERE "URL" LIKE '%google%' limit 1;
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Limit: skip=0, fetch=1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|               |   Filter: hits.URL LIKE Utf8("%google%")                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|               |     TableScan: hits projection=[WatchID, JavaEnable, Title, GoodEvent, EventTime, EventDate, CounterID, ClientIP, RegionID, UserID, CounterClass, OS, UserAgent, URL, Referer, IsRefresh, RefererCategoryID, RefererRegionID, URLCategoryID, URLRegionID, ResolutionWidth, ResolutionHeight, ResolutionDepth, FlashMajor, FlashMinor, FlashMinor2, NetMajor, NetMinor, UserAgentMajor, UserAgentMinor, CookieEnable, JavascriptEnable, IsMobile, MobilePhone, MobilePhoneModel, Params, IPNetworkID, TraficSourceID, SearchEngineID, SearchPhrase, AdvEngineID, IsArtifical, WindowClientWidth, WindowClientHeight, ClientTimeZone, ClientEventTime, SilverlightVersion1, SilverlightVersion2, SilverlightVersion3, SilverlightVersion4, PageCharset, CodeVersion, IsLink, IsDownload, IsNotBounce, FUniqID, OriginalURL, HID, IsOldCounter, IsEvent, IsParameter, DontCountHits, WithHash, HitColor, LocalEventTime, Age, Sex, Income, Interests, Robotness, RemoteIP, WindowName, OpenerName, HistoryLength, BrowserLanguage, BrowserCountry, SocialNetwork, SocialAction, HTTPError, SendTiming, DNSTiming, ConnectTiming, ResponseStartTiming, ResponseEndTiming, FetchTiming, SocialSourceNetworkID, SocialSourcePage, ParamPrice, ParamOrderID, ParamCurrency, ParamCurrencyID, OpenstatServiceName, OpenstatCampaignID, OpenstatAdID, OpenstatSourceID, UTMSource, UTMMedium, UTMCampaign, UTMContent, UTMTerm, FromTag, HasGCLID, RefererHash, URLHash, CLID], partial_filters=[hits.URL LIKE Utf8("%google%")]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| physical_plan | GlobalLimitExec: skip=0, fetch=1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|               |   CoalescePartitionsExec                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|               |     CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|               |       FilterExec: URL@13 LIKE %google%                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|               |         ParquetExec: limit=None, partitions={12 groups: [[hits.parquet:0..1231664704], [hits.parquet:1231664704..2463329408], [hits.parquet:2463329408..3694994112], [hits.parquet:3694994112..4926658816], [hits.parquet:4926658816..6158323520], [hits.parquet:6158323520..7389988224], [hits.parquet:7389988224..8621652928], [hits.parquet:8621652928..9853317632], [hits.parquet:9853317632..11084982336], [hits.parquet:11084982336..12316647040], [hits.parquet:12316647040..13548311744], [hits.parquet:13548311744..14779976446]]}, predicate=URL@13 LIKE %google%, projection=[WatchID, JavaEnable, Title, GoodEvent, EventTime, EventDate, CounterID, ClientIP, RegionID, UserID, CounterClass, OS, UserAgent, URL, Referer, IsRefresh, RefererCategoryID, RefererRegionID, URLCategoryID, URLRegionID, ResolutionWidth, ResolutionHeight, ResolutionDepth, FlashMajor, FlashMinor, FlashMinor2, NetMajor, NetMinor, UserAgentMajor, UserAgentMinor, CookieEnable, JavascriptEnable, IsMobile, MobilePhone, MobilePhoneModel, Params, IPNetworkID, TraficSourceID, SearchEngineID, SearchPhrase, AdvEngineID, IsArtifical, WindowClientWidth, WindowClientHeight, ClientTimeZone, ClientEventTime, SilverlightVersion1, SilverlightVersion2, SilverlightVersion3, SilverlightVersion4, PageCharset, CodeVersion, IsLink, IsDownload, IsNotBounce, FUniqID, OriginalURL, HID, IsOldCounter, IsEvent, IsParameter, DontCountHits, WithHash, HitColor, LocalEventTime, Age, Sex, Income, Interests, Robotness, RemoteIP, WindowName, OpenerName, HistoryLength, BrowserLanguage, BrowserCountry, SocialNetwork, SocialAction, HTTPError, SendTiming, DNSTiming, ConnectTiming, ResponseStartTiming, ResponseEndTiming, FetchTiming, SocialSourceNetworkID, SocialSourcePage, ParamPrice, ParamOrderID, ParamCurrency, ParamCurrencyID, OpenstatServiceName, OpenstatCampaignID, OpenstatAdID, OpenstatSourceID, UTMSource, UTMMedium, UTMCampaign, UTMContent, UTMTerm, FromTag, HasGCLID, RefererHash, URLHash, CLID] |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.055 seconds.
❯ explain analyze SELECT * FROM hits WHERE "URL" LIKE '%google%' limit 1;
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | GlobalLimitExec: skip=0, fetch=1, metrics=[output_rows=1, elapsed_compute=86.566µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|                   |   CoalescePartitionsExec, metrics=[output_rows=410, elapsed_compute=20.659µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|                   |     CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=410, elapsed_compute=2.098602ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                   |       FilterExec: URL@13 LIKE %google%, metrics=[output_rows=13080, elapsed_compute=18.604972ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|                   |         ParquetExec: limit=None, partitions={12 groups: [[hits.parquet:0..1231664704], [hits.parquet:1231664704..2463329408], [hits.parquet:2463329408..3694994112], [hits.parquet:3694994112..4926658816], [hits.parquet:4926658816..6158323520], [hits.parquet:6158323520..7389988224], [hits.parquet:7389988224..8621652928], [hits.parquet:8621652928..9853317632], [hits.parquet:9853317632..11084982336], [hits.parquet:11084982336..12316647040], [hits.parquet:12316647040..13548311744], [hits.parquet:13548311744..14779976446]]}, predicate=URL@13 LIKE %google%, projection=[WatchID, JavaEnable, Title, GoodEvent, EventTime, EventDate, CounterID, ClientIP, RegionID, UserID, CounterClass, OS, UserAgent, URL, Referer, IsRefresh, RefererCategoryID, RefererRegionID, URLCategoryID, URLRegionID, ResolutionWidth, ResolutionHeight, ResolutionDepth, FlashMajor, FlashMinor, FlashMinor2, NetMajor, NetMinor, UserAgentMajor, UserAgentMinor, CookieEnable, JavascriptEnable, IsMobile, MobilePhone, MobilePhoneModel, Params, IPNetworkID, TraficSourceID, SearchEngineID, SearchPhrase, AdvEngineID, IsArtifical, WindowClientWidth, WindowClientHeight, ClientTimeZone, ClientEventTime, SilverlightVersion1, SilverlightVersion2, SilverlightVersion3, SilverlightVersion4, PageCharset, CodeVersion, IsLink, IsDownload, IsNotBounce, FUniqID, OriginalURL, HID, IsOldCounter, IsEvent, IsParameter, DontCountHits, WithHash, HitColor, LocalEventTime, Age, Sex, Income, Interests, Robotness, RemoteIP, WindowName, OpenerName, HistoryLength, BrowserLanguage, BrowserCountry, SocialNetwork, SocialAction, HTTPError, SendTiming, DNSTiming, ConnectTiming, ResponseStartTiming, ResponseEndTiming, FetchTiming, SocialSourceNetworkID, SocialSourcePage, ParamPrice, ParamOrderID, ParamCurrency, ParamCurrencyID, OpenstatServiceName, OpenstatCampaignID, OpenstatAdID, OpenstatSourceID, UTMSource, UTMMedium, UTMCampaign, UTMContent, UTMTerm, FromTag, HasGCLID, RefererHash, URLHash, CLID], metrics=[output_rows=13080, elapsed_compute=12ns, spill_count=0, spilled_bytes=0, mem_used=0, bytes_scanned=13133641179, row_groups_pruned=0, predicate_evaluation_errors=0, num_predicate_creation_errors=0, pushdown_rows_filtered=90983466, page_index_rows_filtered=0, time_elapsed_opening=1.13293624s, time_elapsed_scanning_until_data=5.14554734s, time_elapsed_processing=66.078927867s, pushdown_eval_time=16.006566268s, time_elapsed_scanning_total=73.795071527s, page_index_eval_time=24ns] |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set. Query took 6.489 seconds.
jychen7 commented 1 year ago

Perhaps you may be able to try with https://github.com/apache/arrow-datafusion/pull/5416 and the various predicate pushdown features enabled on ParquetOptions

@tustvold based on the explain result in https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1447472667, Expect: predicate pushdown, limit pushdown Actual: predicate pushdown, limit NOT pushdown

I understand that your PR https://github.com/apache/arrow-datafusion/pull/5416 support limit pushdown in the physical plan, but looks like the above query does not have a limit pushdown in the logical plan or physical plan.

Do you think this is something to improve ❓ I haven't checked the related code in the logical plan, so I will try to take a look tomorrow. (I feel like when ALL predicates are pushed down, we can push down the limit as well. Does this sound cool?)

tustvold commented 1 year ago

It may be https://github.com/apache/arrow-datafusion/issues/4028 is required for this

jychen7 commented 1 year ago

It may be https://github.com/apache/arrow-datafusion/issues/4028 is required for this

yes, I think https://github.com/apache/arrow-datafusion/issues/4028 should help this one 👍 (I try hardcode TableProviderFilterPushDown::Exact for the above query and the limit is pushed down. In this case, it improves from 6s to 0.2s)

alamb commented 1 month ago

I ran this again today

Datafusion

datafusion-cli -c "SELECT * FROM 'hits.parquet' WHERE \"URL\" LIKE '%google%' LIMIT 1;"
...
real    0m3.734s
user    0m49.389s
sys 0m2.288s

DuckDB

time duckdb -c "SELECT * FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%' LIMIT 1;"
...
real    0m0.273s
user    0m0.863s
sys 0m0.884s

I will attempt to get https://github.com/apache/datafusion/issues/4028 done soon