Closed jscottcronin closed 3 months ago
Hello @jscottcronin
The filters
feature in fastparquet acts on row_groups
. It will then load any row group that has compliant values.
In your case, you have a single row group, with compliant values, so it is fully loaded.
If you want to achieve filtering at row level, you should also set row_filter=True
(default is False
).
df = pd.DataFrame({
'label': ['a', 'b'],
'value': [25, 30],
})
df.to_parquet('test.parquet', engine='fastparquet')
df = pd.read_parquet('test.parquet', filters=[('label', '==', 'a')], row_filter=True, engine='fastparquet')
print(df)
label value
0 a 25
If this answers your question, please, close the ticket.
It's worth saying that row_filter is False by default for two reasons:
Thank you for sharing. I wasn't aware of the row partitions in the parquet format and how row filtering is applied. Using row_filter=True
did resolve this issue.
I'm curious if you all think pandas should be applying row_filter=True
if filters are submitted in the fastparquet to_pandas calls. This would streamline the fastparquet implementation in pandas to be more consistent with other pandas io implementations. I can see the positives for making the pandas api more consistent (this issue caught me by surprise), but I'd also like to understand if there are reasons that we might not want to implement this change in pandas.
One might suggest to pandas that (engine=="fastparquet" && filters) should imply row_filter=True; however, as I explained, there are reasons for it not to be the default from our point of view, so I tend to think that things should stay the same.
Describe the issue: When I use pandas to load a parquet file with filters, it seems the filters are not applied. If I use engine='pyarrow' in pd.read_parquet, the filters are applied.
What I'm expecting with the filter is for there to be only one row in the df with label: a, value: 25.
Minimal Complete Verifiable Example:
Results:
I then tried to use native fastparquet to convert to pandas and filtering still does not work (this is what pandas is calling in their library):
Results:
Anything else we need to know?:
I've tried this with various formats of filters: list of list of tuples, list of tuples
I've tried with different types of filters: <, >, =, ==, >=, all failing Environment:
Dask version: no dask
Python version: 3.12.2 via pyenv
Operating System: Apple M3 Max, macOS Sonoma 14.4.1
Install method (conda, pip, source): poetry install