Open jychen7 opened 1 year ago
Additional context Datafusion Plan
> explain ANALYZE SELECT * FROM hits WHERE "URL" LIKE '%google%' limit 1;
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | ProjectionExec: expr=[WatchID@0 as WatchID, JavaEnable@1 as JavaEnable, Title@2 as Title, GoodEvent@3 as GoodEvent, EventTime@4 as EventTime, EventDate@5 as EventDate, CounterID@6 as CounterID, ClientIP@7 as ClientIP, RegionID@8 as RegionID, UserID@9 as UserID, CounterClass@10 as CounterClass, OS@11 as OS, UserAgent@12 as UserAgent, URL@13 as URL, Referer@14 as Referer, IsRefresh@15 as IsRefresh, RefererCategoryID@16 as RefererCategoryID, RefererRegionID@17 as RefererRegionID, URLCategoryID@18 as URLCategoryID, URLRegionID@19 as URLRegionID, ResolutionWidth@20 as ResolutionWidth, ResolutionHeight@21 as ResolutionHeight, ResolutionDepth@22 as ResolutionDepth, FlashMajor@23 as FlashMajor, FlashMinor@24 as FlashMinor, FlashMinor2@25 as FlashMinor2, NetMajor@26 as NetMajor, NetMinor@27 as NetMinor, UserAgentMajor@28 as UserAgentMajor, UserAgentMinor@29 as UserAgentMinor, CookieEnable@30 as CookieEnable, JavascriptEnable@31 as JavascriptEnable, IsMobile@32 as IsMobile, MobilePhone@33 as MobilePhone, MobilePhoneModel@34 as MobilePhoneModel, Params@35 as Params, IPNetworkID@36 as IPNetworkID, TraficSourceID@37 as TraficSourceID, SearchEngineID@38 as SearchEngineID, SearchPhrase@39 as SearchPhrase, AdvEngineID@40 as AdvEngineID, IsArtifical@41 as IsArtifical, WindowClientWidth@42 as WindowClientWidth, WindowClientHeight@43 as WindowClientHeight, ClientTimeZone@44 as ClientTimeZone, ClientEventTime@45 as ClientEventTime, SilverlightVersion1@46 as SilverlightVersion1, SilverlightVersion2@47 as SilverlightVersion2, SilverlightVersion3@48 as SilverlightVersion3, SilverlightVersion4@49 as SilverlightVersion4, PageCharset@50 as PageCharset, CodeVersion@51 as CodeVersion, IsLink@52 as IsLink, IsDownload@53 as IsDownload, IsNotBounce@54 as IsNotBounce, FUniqID@55 as FUniqID, OriginalURL@56 as OriginalURL, HID@57 as HID, IsOldCounter@58 as IsOldCounter, IsEvent@59 as IsEvent, IsParameter@60 as IsParameter, DontCountHits@61 as DontCountHits, WithHash@62 as WithHash, HitColor@63 as HitColor, LocalEventTime@64 as LocalEventTime, Age@65 as Age, Sex@66 as Sex, Income@67 as Income, Interests@68 as Interests, Robotness@69 as Robotness, RemoteIP@70 as RemoteIP, WindowName@71 as WindowName, OpenerName@72 as OpenerName, HistoryLength@73 as HistoryLength, BrowserLanguage@74 as BrowserLanguage, BrowserCountry@75 as BrowserCountry, SocialNetwork@76 as SocialNetwork, SocialAction@77 as SocialAction, HTTPError@78 as HTTPError, SendTiming@79 as SendTiming, DNSTiming@80 as DNSTiming, ConnectTiming@81 as ConnectTiming, ResponseStartTiming@82 as ResponseStartTiming, ResponseEndTiming@83 as ResponseEndTiming, FetchTiming@84 as FetchTiming, SocialSourceNetworkID@85 as SocialSourceNetworkID, SocialSourcePage@86 as SocialSourcePage, ParamPrice@87 as ParamPrice, ParamOrderID@88 as ParamOrderID, ParamCurrency@89 as ParamCurrency, ParamCurrencyID@90 as ParamCurrencyID, OpenstatServiceName@91 as OpenstatServiceName, OpenstatCampaignID@92 as OpenstatCampaignID, OpenstatAdID@93 as OpenstatAdID, OpenstatSourceID@94 as OpenstatSourceID, UTMSource@95 as UTMSource, UTMMedium@96 as UTMMedium, UTMCampaign@97 as UTMCampaign, UTMContent@98 as UTMContent, UTMTerm@99 as UTMTerm, FromTag@100 as FromTag, HasGCLID@101 as HasGCLID, RefererHash@102 as RefererHash, URLHash@103 as URLHash, CLID@104 as CLID], metrics=[output_rows=1, elapsed_compute=21.342µs, spill_count=0, spilled_bytes=0, mem_used=0] |
| | GlobalLimitExec: skip=0, fetch=1, metrics=[output_rows=1, elapsed_compute=55.749µs, spill_count=0, spilled_bytes=0, mem_used=0] |
| | CoalescePartitionsExec, metrics=[output_rows=410, elapsed_compute=19.372µs, spill_count=0, spilled_bytes=0, mem_used=0] |
| | CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=410, elapsed_compute=25.377822ms, spill_count=0, spilled_bytes=0, mem_used=0] |
| | FilterExec: URL@13 LIKE %google%, metrics=[output_rows=10767, elapsed_compute=12.106422195s, spill_count=0, spilled_bytes=0, mem_used=0] |
| | ParquetExec: limit=None, partitions={12 groups: [[hits.parquet:0..1231664704], [hits.parquet:1231664704..2463329408], [hits.parquet:2463329408..3694994112], [hits.parquet:3694994112..4926658816], [hits.parquet:4926658816..6158323520], [hits.parquet:6158323520..7389988224], [hits.parquet:7389988224..8621652928], [hits.parquet:8621652928..9853317632], [hits.parquet:9853317632..11084982336], [hits.parquet:11084982336..12316647040], [hits.parquet:12316647040..13548311744], [hits.parquet:13548311744..14779976446]]}, predicate=URL LIKE Utf8("%google%"), projection=[WatchID, JavaEnable, Title, GoodEvent, EventTime, EventDate, CounterID, ClientIP, RegionID, UserID, CounterClass, OS, UserAgent, URL, Referer, IsRefresh, RefererCategoryID, RefererRegionID, URLCategoryID, URLRegionID, ResolutionWidth, ResolutionHeight, ResolutionDepth, FlashMajor, FlashMinor, FlashMinor2, NetMajor, NetMinor, UserAgentMajor, UserAgentMinor, CookieEnable, JavascriptEnable, IsMobile, MobilePhone, MobilePhoneModel, Params, IPNetworkID, TraficSourceID, SearchEngineID, SearchPhrase, AdvEngineID, IsArtifical, WindowClientWidth, WindowClientHeight, ClientTimeZone, ClientEventTime, SilverlightVersion1, SilverlightVersion2, SilverlightVersion3, SilverlightVersion4, PageCharset, CodeVersion, IsLink, IsDownload, IsNotBounce, FUniqID, OriginalURL, HID, IsOldCounter, IsEvent, IsParameter, DontCountHits, WithHash, HitColor, LocalEventTime, Age, Sex, Income, Interests, Robotness, RemoteIP, WindowName, OpenerName, HistoryLength, BrowserLanguage, BrowserCountry, SocialNetwork, SocialAction, HTTPError, SendTiming, DNSTiming, ConnectTiming, ResponseStartTiming, ResponseEndTiming, FetchTiming, SocialSourceNetworkID, SocialSourcePage, ParamPrice, ParamOrderID, ParamCurrency, ParamCurrencyID, OpenstatServiceName, OpenstatCampaignID, OpenstatAdID, OpenstatSourceID, UTMSource, UTMMedium, UTMCampaign, UTMContent, UTMTerm, FromTag, HasGCLID, RefererHash, URLHash, CLID], metrics=[output_rows=73363236, elapsed_compute=12ns, spill_count=0, spilled_bytes=0, mem_used=0, predicate_evaluation_errors=0, pushdown_rows_filtered=0, row_groups_pruned=0, bytes_scanned=11496382994, page_index_rows_filtered=0, num_predicate_creation_errors=0, time_elapsed_scanning_total=151.775198747s, pushdown_eval_time=24ns, time_elapsed_processing=133.355923578s, time_elapsed_opening=1.135712999s, page_index_eval_time=24ns, time_elapsed_scanning_until_data=2.005873968s] |
| | |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set. Query took 12.817 seconds.
DuckDB Plan
> explain ANALYZE SELECT * FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%' LIMIT 1;
┌─────────────────────────────┐
│┌───────────────────────────┐│
│└───────────────────────────┘│
└─────────────────────────────┘
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
explain ANALYZE SELECT * FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%' LIMIT 1;
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Total Time: 0.475s ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│ EXPLAIN_ANALYZE │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ LIMIT │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ WatchID │
│ JavaEnable │
│ Title │
│ GoodEvent │
│ EventTime │
│ EventDate │
│ CounterID │
│ ClientIP │
│ RegionID │
│ UserID │
│ CounterClass │
│ OS │
│ UserAgent │
│ URL │
│ Referer │
│ IsRefresh │
│ RefererCategoryID │
│ RefererRegionID │
│ URLCategoryID │
│ URLRegionID │
│ ResolutionWidth │
│ ResolutionHeight │
│ ResolutionDepth │
│ FlashMajor │
│ FlashMinor │
│ FlashMinor2 │
│ NetMajor │
│ NetMinor │
│ UserAgentMajor │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ contains(URL, 'google') │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 29 │
│ (0.07s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PARQUET_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ URL │
│ WatchID │
│ JavaEnable │
│ Title │
│ GoodEvent │
│ EventTime │
│ EventDate │
│ CounterID │
│ ClientIP │
│ RegionID │
│ UserID │
│ CounterClass │
│ OS │
│ UserAgent │
│ Referer │
│ IsRefresh │
│ RefererCategoryID │
│ RefererRegionID │
│ URLCategoryID │
│ URLRegionID │
│ ResolutionWidth │
│ ResolutionHeight │
│ ResolutionDepth │
│ FlashMajor │
│ FlashMinor │
│ FlashMinor2 │
│ NetMajor │
│ NetMinor │
│ UserAgentMajor │
└───────────────────────────┘
Run Time (s): real 0.590 user 1.888174 sys 0.381627
FilterExec: URL@13 LIKE %google%, metrics=[output_rows=10767
looks like Datafusion does no push down limit
to parquet scanner?
update: seems duplicate with
TopK is a partial factor but it is not the best.
URL
column at the first query and apply the order limit then projection other columns by rowids.URL
is a large binary column in the hits dataset, duckdb optimized reading parquet to it's memory model (reused the original buffer). You can prove that by select max(URL) from table
Lazy projection(aka Later projection) can improve this case mostly, with this we just fetch URL column at the first query and apply the order limit then projection other columns by rowids.
@sundy-li , you are right, select one URL
column makes datafusion and duckdb closer, 1.75x.
# datafusion v19.rc1
> SELECT "URL" FROM hits WHERE "URL" LIKE '%google%';
// 15911 rows in set. Query took 5.626 seconds
# duckdb v0.6.1
> SELECT URL FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%';
// 15911 rows. Run Time (s): real 3.207 user 34.520757 sys 1.555060
URL is a large binary column in the hits dataset, duckdb optimized reading parquet to it's memory model (reused the original buffer). You can prove that by select max(URL) from table
@sundy-li thanks for the info. Do you happen to know the code/blog link to "reused the original buffer"?
I did a test and found datafusion and duckdb performs basically same in SELECT max("URL") FROM hits
# datafusion v19.rc1
SELECT max("URL") FROM hits;
+-----------------------------------------+
| MAX(hits.URL) |
+-----------------------------------------+
| https://yugra-advert2792270][to]=&input |
+-----------------------------------------+
1 row in set. Query took 2.726 seconds.
# duckdb v0.6.1
┌─────────────────────────────────────────┐
│ max("URL") │
│ varchar │
├─────────────────────────────────────────┤
│ https://yugra-advert2792270][to]=&input │
└─────────────────────────────────────────┘
Run Time (s): real 2.746 user 28.837000 sys 2.205152
Do you happen to know the code/blog link to "reused the original buffer"
I'm sorry, arrow-rs
already implement it, arrow2 had this issue.
I'm sorry, arrow-rs already implement it, arrow2 had this issue.
@sundy-li TIL, thank you
@jychen7 I checked in my 16-core linux with SSD, duckdb read parquet still faster.
duckdb v0.6.0
D CREATE VIEW hits AS
> SELECT *
> REPLACE
> (epoch_ms(EventTime * 1000) AS EventTime,
> DATE '1970-01-01' + INTERVAL (EventDate) DAYS AS EventDate)
> FROM read_parquet('hits.parquet', binary_as_string=True);
D
D .timer on
D select count(1), max(URL) from hits;
┌──────────┬─────────────────────────────────────────┐
│ count(1) │ max("URL") │
│ int64 │ varchar │
├──────────┼─────────────────────────────────────────┤
│ 99997497 │ https://yugra-advert2792270][to]=&input │
└──────────┴─────────────────────────────────────────┘
Run Time (s): real 0.957 user 21.641735 sys 4.245200
datafusion:
❯ select max("URL") from hits;
+-----------------------------------------+
| MAX(hits.URL) |
+-----------------------------------------+
| https://yugra-advert2792270][to]=&input |
+-----------------------------------------+
1 row in set. Query took 2.849 seconds.
Do you happen to know the code/blog link to "reused the original buffer
I used to look at these codes, duckdb's memory model for String column
did not strictly follow as Arrow.
Not sure but may related to: https://github.com/duckdb/duckdb/blob/master/src/common/types/vector.cpp#L1428-L1436
If the original parquet column data is plain encoding
, it could directly use the io buffer by reference without extra allocation and clone.
Lazy projection(aka Later projection) can improve this case mostly
Support for this was added to parquet by @thinkharderdev not too long ago, it may just be a case of hooking it up - https://github.com/apache/arrow-rs/pull/3633
Edit: perhaps you may be able to try with https://github.com/apache/arrow-datafusion/pull/5416 and the various predicate pushdown features enabled on ParquetOptions
I checked in my 16-core linux with SSD, duckdb read parquet still faster. DuckDB v0.6.0: 0.957s Datafusion 2.849s
@sundy-li my macOS only have 6 Core, but inspiring by your message, I find out github build version is faster than brew one (Brew is macOS package manager). More details here: https://github.com/duckdb/duckdb/issues/6495
perhaps you may be able to try with https://github.com/apache/arrow-datafusion/pull/5416 and the various predicate pushdown features enabled on ParquetOptions
@tustvold thank you, nice work! let me try main (as of now, c676d1026f8fa25c1495ec139b8a379ce1f2f86b) which includes #5416
update: yes, 2x faster
before | after | % | |
---|---|---|---|
GlobalLimitExec | 55 µs | 86 µs | |
CoalescePartitionsExec | 19 µs | 20 µs | |
CoalesceBatchesExec | 25 ms | 2 ms | |
FilterExec | 12000 ms | 18 ms | |
ParquetExec pushdown_rows_filtered | 0 | 90983466 | rows are filtered in parquet reader |
ParquetExec time_elapsed_scanning_total | 151 s | 73 s | 200% faster |
I locally turn on pushdown_filters: bool, default = true
, the config was
https://github.com/apache/arrow-datafusion/blob/c676d1026f8fa25c1495ec139b8a379ce1f2f86b/datafusion/common/src/config.rs#L248-L250
plan result
❯ explain SELECT * FROM hits WHERE "URL" LIKE '%google%' limit 1;
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Limit: skip=0, fetch=1 |
| | Filter: hits.URL LIKE Utf8("%google%") |
| | TableScan: hits projection=[WatchID, JavaEnable, Title, GoodEvent, EventTime, EventDate, CounterID, ClientIP, RegionID, UserID, CounterClass, OS, UserAgent, URL, Referer, IsRefresh, RefererCategoryID, RefererRegionID, URLCategoryID, URLRegionID, ResolutionWidth, ResolutionHeight, ResolutionDepth, FlashMajor, FlashMinor, FlashMinor2, NetMajor, NetMinor, UserAgentMajor, UserAgentMinor, CookieEnable, JavascriptEnable, IsMobile, MobilePhone, MobilePhoneModel, Params, IPNetworkID, TraficSourceID, SearchEngineID, SearchPhrase, AdvEngineID, IsArtifical, WindowClientWidth, WindowClientHeight, ClientTimeZone, ClientEventTime, SilverlightVersion1, SilverlightVersion2, SilverlightVersion3, SilverlightVersion4, PageCharset, CodeVersion, IsLink, IsDownload, IsNotBounce, FUniqID, OriginalURL, HID, IsOldCounter, IsEvent, IsParameter, DontCountHits, WithHash, HitColor, LocalEventTime, Age, Sex, Income, Interests, Robotness, RemoteIP, WindowName, OpenerName, HistoryLength, BrowserLanguage, BrowserCountry, SocialNetwork, SocialAction, HTTPError, SendTiming, DNSTiming, ConnectTiming, ResponseStartTiming, ResponseEndTiming, FetchTiming, SocialSourceNetworkID, SocialSourcePage, ParamPrice, ParamOrderID, ParamCurrency, ParamCurrencyID, OpenstatServiceName, OpenstatCampaignID, OpenstatAdID, OpenstatSourceID, UTMSource, UTMMedium, UTMCampaign, UTMContent, UTMTerm, FromTag, HasGCLID, RefererHash, URLHash, CLID], partial_filters=[hits.URL LIKE Utf8("%google%")] |
| physical_plan | GlobalLimitExec: skip=0, fetch=1 |
| | CoalescePartitionsExec |
| | CoalesceBatchesExec: target_batch_size=8192 |
| | FilterExec: URL@13 LIKE %google% |
| | ParquetExec: limit=None, partitions={12 groups: [[hits.parquet:0..1231664704], [hits.parquet:1231664704..2463329408], [hits.parquet:2463329408..3694994112], [hits.parquet:3694994112..4926658816], [hits.parquet:4926658816..6158323520], [hits.parquet:6158323520..7389988224], [hits.parquet:7389988224..8621652928], [hits.parquet:8621652928..9853317632], [hits.parquet:9853317632..11084982336], [hits.parquet:11084982336..12316647040], [hits.parquet:12316647040..13548311744], [hits.parquet:13548311744..14779976446]]}, predicate=URL@13 LIKE %google%, projection=[WatchID, JavaEnable, Title, GoodEvent, EventTime, EventDate, CounterID, ClientIP, RegionID, UserID, CounterClass, OS, UserAgent, URL, Referer, IsRefresh, RefererCategoryID, RefererRegionID, URLCategoryID, URLRegionID, ResolutionWidth, ResolutionHeight, ResolutionDepth, FlashMajor, FlashMinor, FlashMinor2, NetMajor, NetMinor, UserAgentMajor, UserAgentMinor, CookieEnable, JavascriptEnable, IsMobile, MobilePhone, MobilePhoneModel, Params, IPNetworkID, TraficSourceID, SearchEngineID, SearchPhrase, AdvEngineID, IsArtifical, WindowClientWidth, WindowClientHeight, ClientTimeZone, ClientEventTime, SilverlightVersion1, SilverlightVersion2, SilverlightVersion3, SilverlightVersion4, PageCharset, CodeVersion, IsLink, IsDownload, IsNotBounce, FUniqID, OriginalURL, HID, IsOldCounter, IsEvent, IsParameter, DontCountHits, WithHash, HitColor, LocalEventTime, Age, Sex, Income, Interests, Robotness, RemoteIP, WindowName, OpenerName, HistoryLength, BrowserLanguage, BrowserCountry, SocialNetwork, SocialAction, HTTPError, SendTiming, DNSTiming, ConnectTiming, ResponseStartTiming, ResponseEndTiming, FetchTiming, SocialSourceNetworkID, SocialSourcePage, ParamPrice, ParamOrderID, ParamCurrency, ParamCurrencyID, OpenstatServiceName, OpenstatCampaignID, OpenstatAdID, OpenstatSourceID, UTMSource, UTMMedium, UTMCampaign, UTMContent, UTMTerm, FromTag, HasGCLID, RefererHash, URLHash, CLID] |
| | |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.055 seconds.
❯ explain analyze SELECT * FROM hits WHERE "URL" LIKE '%google%' limit 1;
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | GlobalLimitExec: skip=0, fetch=1, metrics=[output_rows=1, elapsed_compute=86.566µs, spill_count=0, spilled_bytes=0, mem_used=0] |
| | CoalescePartitionsExec, metrics=[output_rows=410, elapsed_compute=20.659µs, spill_count=0, spilled_bytes=0, mem_used=0] |
| | CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=410, elapsed_compute=2.098602ms, spill_count=0, spilled_bytes=0, mem_used=0] |
| | FilterExec: URL@13 LIKE %google%, metrics=[output_rows=13080, elapsed_compute=18.604972ms, spill_count=0, spilled_bytes=0, mem_used=0] |
| | ParquetExec: limit=None, partitions={12 groups: [[hits.parquet:0..1231664704], [hits.parquet:1231664704..2463329408], [hits.parquet:2463329408..3694994112], [hits.parquet:3694994112..4926658816], [hits.parquet:4926658816..6158323520], [hits.parquet:6158323520..7389988224], [hits.parquet:7389988224..8621652928], [hits.parquet:8621652928..9853317632], [hits.parquet:9853317632..11084982336], [hits.parquet:11084982336..12316647040], [hits.parquet:12316647040..13548311744], [hits.parquet:13548311744..14779976446]]}, predicate=URL@13 LIKE %google%, projection=[WatchID, JavaEnable, Title, GoodEvent, EventTime, EventDate, CounterID, ClientIP, RegionID, UserID, CounterClass, OS, UserAgent, URL, Referer, IsRefresh, RefererCategoryID, RefererRegionID, URLCategoryID, URLRegionID, ResolutionWidth, ResolutionHeight, ResolutionDepth, FlashMajor, FlashMinor, FlashMinor2, NetMajor, NetMinor, UserAgentMajor, UserAgentMinor, CookieEnable, JavascriptEnable, IsMobile, MobilePhone, MobilePhoneModel, Params, IPNetworkID, TraficSourceID, SearchEngineID, SearchPhrase, AdvEngineID, IsArtifical, WindowClientWidth, WindowClientHeight, ClientTimeZone, ClientEventTime, SilverlightVersion1, SilverlightVersion2, SilverlightVersion3, SilverlightVersion4, PageCharset, CodeVersion, IsLink, IsDownload, IsNotBounce, FUniqID, OriginalURL, HID, IsOldCounter, IsEvent, IsParameter, DontCountHits, WithHash, HitColor, LocalEventTime, Age, Sex, Income, Interests, Robotness, RemoteIP, WindowName, OpenerName, HistoryLength, BrowserLanguage, BrowserCountry, SocialNetwork, SocialAction, HTTPError, SendTiming, DNSTiming, ConnectTiming, ResponseStartTiming, ResponseEndTiming, FetchTiming, SocialSourceNetworkID, SocialSourcePage, ParamPrice, ParamOrderID, ParamCurrency, ParamCurrencyID, OpenstatServiceName, OpenstatCampaignID, OpenstatAdID, OpenstatSourceID, UTMSource, UTMMedium, UTMCampaign, UTMContent, UTMTerm, FromTag, HasGCLID, RefererHash, URLHash, CLID], metrics=[output_rows=13080, elapsed_compute=12ns, spill_count=0, spilled_bytes=0, mem_used=0, bytes_scanned=13133641179, row_groups_pruned=0, predicate_evaluation_errors=0, num_predicate_creation_errors=0, pushdown_rows_filtered=90983466, page_index_rows_filtered=0, time_elapsed_opening=1.13293624s, time_elapsed_scanning_until_data=5.14554734s, time_elapsed_processing=66.078927867s, pushdown_eval_time=16.006566268s, time_elapsed_scanning_total=73.795071527s, page_index_eval_time=24ns] |
| | |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set. Query took 6.489 seconds.
Perhaps you may be able to try with https://github.com/apache/arrow-datafusion/pull/5416 and the various predicate pushdown features enabled on ParquetOptions
@tustvold based on the explain
result in https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1447472667,
Expect: predicate pushdown, limit pushdown
Actual: predicate pushdown, limit NOT pushdown
I understand that your PR https://github.com/apache/arrow-datafusion/pull/5416 support limit pushdown
in the physical plan, but looks like the above query does not have a limit pushdown in the logical plan or physical plan.
Do you think this is something to improve ❓ I haven't checked the related code in the logical plan, so I will try to take a look tomorrow. (I feel like when ALL predicates are pushed down, we can push down the limit as well. Does this sound cool?)
It may be https://github.com/apache/arrow-datafusion/issues/4028 is required for this
It may be https://github.com/apache/arrow-datafusion/issues/4028 is required for this
yes, I think https://github.com/apache/arrow-datafusion/issues/4028 should help this one 👍
(I try hardcode TableProviderFilterPushDown::Exact
for the above query and the limit is pushed down. In this case, it improves from 6s
to 0.2s
)
I ran this again today
datafusion-cli -c "SELECT * FROM 'hits.parquet' WHERE \"URL\" LIKE '%google%' LIMIT 1;"
...
real 0m3.734s
user 0m49.389s
sys 0m2.288s
time duckdb -c "SELECT * FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%' LIMIT 1;"
...
real 0m0.273s
user 0m0.863s
sys 0m0.884s
I will attempt to get https://github.com/apache/datafusion/issues/4028 done soon
Describe the problem This is NOT a bug, but an potential improvement goal
Datafusion v19.rc1 by default turn on
repartition_file_scans
at https://github.com/apache/arrow-datafusion/pull/5295with my local Macbook Pro (2.6 GHz 6-Core Intel Core i7, 32 GB 2667 MHz DDR4), for following query on clickbench 14GB
hits.parquet
:real 0.566 user 1.876031 sys 0.357483
To Reproduce Download data file
Prepare SQL create a file called
create.sql
create a file called
q23_no_order_limit_1.sql
Datafusion
DuckDB
Expected behavior