Issues with filtering when using to_pandas

jscottcronin commented 3 months ago

Describe the issue: When I use pandas to load a parquet file with filters, it seems the filters are not applied. If I use engine='pyarrow' in pd.read_parquet, the filters are applied.

What I'm expecting with the filter is for there to be only one row in the df with label: a, value: 25.

Minimal Complete Verifiable Example:

df = pd.DataFrame({
    'label': ['a', 'b'],
    'value': [25, 30],
})

df.to_parquet('test.parquet', engine='fastparquet')
df = pd.read_parquet('test.parquet', filters=[('label', '==', 'a')], engine='fastparquet')
print(df)

Results:

  label  value
0     a     25
1     b     30

I then tried to use native fastparquet to convert to pandas and filtering still does not work (this is what pandas is calling in their library):

import ParquetFile
pf = ParquetFile('test.parquet')
df = pf.to_pandas(filters=[('label', '==', 'a')])
print(df)

Results:

  label  value
0     a     25
1     b     30

Anything else we need to know?:

I've tried this with various formats of filters: list of list of tuples, list of tuples
I've tried with different types of filters: <, >, =, ==, >=, all failing Environment:
Dask version: no dask
Python version: 3.12.2 via pyenv
Operating System: Apple M3 Max, macOS Sonoma 14.4.1
Install method (conda, pip, source): poetry install

yohplala commented 3 months ago

Hello @jscottcronin The filters feature in fastparquet acts on row_groups. It will then load any row group that has compliant values. In your case, you have a single row group, with compliant values, so it is fully loaded.

If you want to achieve filtering at row level, you should also set row_filter=True (default is False).

df = pd.DataFrame({
    'label': ['a', 'b'],
    'value': [25, 30],
})

df.to_parquet('test.parquet', engine='fastparquet')
df = pd.read_parquet('test.parquet', filters=[('label', '==', 'a')], row_filter=True, engine='fastparquet')
print(df)
  label  value
0     a     25

If this answers your question, please, close the ticket.

martindurant commented 3 months ago

It's worth saying that row_filter is False by default for two reasons:

partition filtering came first by a long way (also in arrow, where it came later that fp, but still before row-filtering), and we didn't want to impose a behaviour change
row filtering isn't particularly efficient, because it does two passes and if you mean to keep the determinant columns, they are read twice too. In at least some cases, it is faster to read all and use pandas for the filtering.

jscottcronin commented 3 months ago

Thank you for sharing. I wasn't aware of the row partitions in the parquet format and how row filtering is applied. Using row_filter=True did resolve this issue.

I'm curious if you all think pandas should be applying row_filter=True if filters are submitted in the fastparquet to_pandas calls. This would streamline the fastparquet implementation in pandas to be more consistent with other pandas io implementations. I can see the positives for making the pandas api more consistent (this issue caught me by surprise), but I'd also like to understand if there are reasons that we might not want to implement this change in pandas.

martindurant commented 3 months ago

One might suggest to pandas that (engine=="fastparquet" && filters) should imply row_filter=True; however, as I explained, there are reasons for it not to be the default from our point of view, so I tend to think that things should stay the same.

dask / fastparquet

Issues with filtering when using to_pandas #926