Open Matthieusalor opened 1 year ago
I've labeled this C++ since the eventual fix will probably need to be there. You are correct that row group filtering is not currently happening in get_fragments
. It may not be the simplest thing to fix. I suspect that comment is from the legacy parquet dataset which may have operated in this fashion.
Unfortunately, we do not load the parquet metadata for every single fragment when a dataset is created. In fact, if you specify a list of files and a schema at dataset creation, we won't load any data at all from disk. So we don't have the statistics at this point.
Describe the bug, including details regarding any error messages, version, and platform.
Based on the documentation https://arrow.apache.org/docs/6.0/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.get_fragments _Return fragments matching the optional filter, either using the partitionexpression or internal information like Parquet’s statistics.
I would have assumed that the following code would return only one fragment. However, the expression seems to be applied only to the partitions as both fragments are being returned even though only one matches the predicate if you look at the statitics
Version
'11.0.0'
Component(s)
Python