Describe the bug
Queries against tables registered with register_dataset() perform around 80x slower than those registered with register_parquet().
To Reproduce
import datafusion
import pyarrow.dataset as ds
from pathlib import Path
ctx = datafusion.SessionContext()
ctx.register_parquet("mytable", "*.parquet")
ctx.register_dataset("mytable2", ds.dataset(list(Path(".").glob("*.parquet"))))
Fast:
%time ctx.sql('select file_date, sum("Price" * "Volume") from mytable group by file_date order by file_date').to_arrow_table()
CPU times: user 2min 41s, sys: 3.35 s, total: 2min 45s
Wall time: 2.49 s
Slow:
%time ctx.sql('select file_date, sum("Price" * "Volume") from mytable2 group by file_date order by file_date').to_arrow_table()
CPU times: user 10min 51s, sys: 5min 40s, total: 16min 31s
Wall time: 3min 18s
Expected behavior
I'd expect these to be similar performance.
Additional context
The reason I'm using ds.dataset is because the actual files I'm interesting in accessing are not conveniently globbable (they're across multiple directories). So ideally I'd be able to provide a list of files to ctx.register_parquet() instead of a simple glob.
I'm having trouble reproducing this error locally. Can you tell me which versions of datafusion and pyarrow you are using? Also, roughly how large is the dataset and split into how many files?
Describe the bug Queries against tables registered with
register_dataset()
perform around 80x slower than those registered withregister_parquet()
.To Reproduce
Fast:
Slow:
Expected behavior I'd expect these to be similar performance.
Additional context The reason I'm using
ds.dataset
is because the actual files I'm interesting in accessing are not conveniently globbable (they're across multiple directories). So ideally I'd be able to provide a list of files toctx.register_parquet()
instead of a simple glob.