apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.6k stars 3.54k forks source link

[C++] get_fragments filter argument not filtering on statistics #35841

Open Matthieusalor opened 1 year ago

Matthieusalor commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

Based on the documentation https://arrow.apache.org/docs/6.0/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.get_fragments _Return fragments matching the optional filter, either using the partitionexpression or internal information like Parquet’s statistics.

I would have assumed that the following code would return only one fragment. However, the expression seems to be applied only to the partitions as both fragments are being returned even though only one matches the predicate if you look at the statitics

import pyarrow.dataset as pds
import pandas as pd

ds_path = './my_dataset'
df = pd.DataFrame({
    'A': [1,2],
    'B': [1,2]
})
df.to_parquet(ds_path, partition_cols=['A'])

pdataset = pds.dataset(ds_path, format='parquet', partitioning='hive')

fragments = pdataset.get_fragments(
    filter=(pds.field('B') == 1)
)

for f in fragments:
    print('-------------')
    print(f.path)
    print(f.to_table())
    for row_group in f.row_groups:
        print(row_group.statistics)

Version

'11.0.0'

Component(s)

Python

westonpace commented 1 year ago

I've labeled this C++ since the eventual fix will probably need to be there. You are correct that row group filtering is not currently happening in get_fragments. It may not be the simplest thing to fix. I suspect that comment is from the legacy parquet dataset which may have operated in this fashion.

Unfortunately, we do not load the parquet metadata for every single fragment when a dataset is created. In fact, if you specify a list of files and a schema at dataset creation, we won't load any data at all from disk. So we don't have the statistics at this point.