Open CaselIT opened 11 months ago
You are rescanning the whole subdir on every invocation of pyarrow.parquet.read_table
Scanning the filesystem only once for parquet files with pa.dataset.dataset
and then filtering the data on the key column with dataset scanner is much faster:
In [14]: with timectx("scan for parquet files using dataset"):
...: write_to_dataset_dataset = pa.dataset.dataset("write_to_dataset/", partitioning=pa.dataset.partitioning(flavor="hive"))
...:
scan for parquet files using dataset 40.98821000661701 ms
In [143]: with timectx("load partitions using dataset"):
...: for key in keys:
...: write_to_dataset_dataset.scanner(filter=(pc.field("key") == key)).to_table()
...:
load partitions using read_table 558.3812920376658 ms
In [144]: with timectx("load partitions using read_table"):
...: for key in keys:
...: pyarrow.parquet.read_table("write_to_dataset", filters=[("key", "==", key)])
...:
load partitions using read_table 5385.372799937613 ms
Thanks for the reply!
Sure, but my use case is more:
I think something similar is a very common use case, where it's unfeasible to read the dataset only once and reuse it many times.
This is also how parquet are usually read using pandas, polars and most other dataframe libraries: df_lib.read_parquet('file', filters=[...])
If that's possible also opening the parquet only once in this row group partition results in much better results:
with timectx("load partitions from row group"):
for key in keys:
index = key_to_index[key]
with pyarrow.parquet.ParquetFile("row_group") as pf:
pf.read_row_group(index)
load partitions from row group 4809.91979999817 ms
move the file open outside the for
with timectx("load partitions from row group - single file open"):
with pyarrow.parquet.ParquetFile("row_group") as pf:
for key in keys:
index = key_to_index[key]
pf.read_row_group(index)
load partitions from row group - single file open 196.3233999995282 ms
edited to remove polars use
For reference on my pc your suggestion is on a little better compared with re-opening the row group file each time in the for, but it's still ~20 times slower of doing the same using the proposed partitioning scheme.
This gives me the following times:
import pyarrow.compute as pc
import pyarrow.dataset
with timectx("load partitions using read_table - read dataset once"):
write_to_dataset_dataset = pyarrow.dataset.dataset(
"write_to_dataset/", partitioning=pyarrow.dataset.partitioning(flavor="hive")
)
for key in keys:
write_to_dataset_dataset.scanner(filter=(pc.field("key") == key)).to_table()
load partitions using read_table - read dataset once 3847.4653999874135 ms
So I still think this scheme has significant advantages compared to hive
edited to remove polars use
Describe the enhancement requested
Hi,
I'm testing how to save in an optimal mode a table partitioned over one or more columns, with the objective of reading the single partitions later. While trying this I noticed that the current logic to save and especially load a partitioned table is very slow.
While experimenting on ways of speeding up this use case I noticed that the row groups of parquet can have an arbitrary size and a single parquet can have row groups of many different sizes. I think there could be a partitioning scheme that makes use of this feature of parquet files.
From some initial the tests I've done using only the public python api the serialization time is generally comparable while the read time is up to 10x faster when increasing the number of partitions. It's likely that if this scheme were natively implemented the result would be better.
The example code I tried is the following. I'm using pandas to partition the table, but it should not matter much on the times (if anything it should penalize the row group partitioning).
The result on windows are on my pc the following:
Using docker the results are similar
In both cases I'm using python 3.10 with the following libraries:
A couple of consideration:
Thanks for reading.
Component(s)
Parquet