dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
772 stars 177 forks source link

Issues with filtering when using to_pandas #926

Closed jscottcronin closed 3 months ago

jscottcronin commented 3 months ago

Describe the issue: When I use pandas to load a parquet file with filters, it seems the filters are not applied. If I use engine='pyarrow' in pd.read_parquet, the filters are applied.

What I'm expecting with the filter is for there to be only one row in the df with label: a, value: 25.

Minimal Complete Verifiable Example:

df = pd.DataFrame({
    'label': ['a', 'b'],
    'value': [25, 30],
})

df.to_parquet('test.parquet', engine='fastparquet')
df = pd.read_parquet('test.parquet', filters=[('label', '==', 'a')], engine='fastparquet')
print(df)

Results:

  label  value
0     a     25
1     b     30

I then tried to use native fastparquet to convert to pandas and filtering still does not work (this is what pandas is calling in their library):

import ParquetFile
pf = ParquetFile('test.parquet')
df = pf.to_pandas(filters=[('label', '==', 'a')])
print(df)

Results:

  label  value
0     a     25
1     b     30

Anything else we need to know?:

yohplala commented 3 months ago

Hello @jscottcronin The filters feature in fastparquet acts on row_groups. It will then load any row group that has compliant values. In your case, you have a single row group, with compliant values, so it is fully loaded.

If you want to achieve filtering at row level, you should also set row_filter=True (default is False).

df = pd.DataFrame({
    'label': ['a', 'b'],
    'value': [25, 30],
})

df.to_parquet('test.parquet', engine='fastparquet')
df = pd.read_parquet('test.parquet', filters=[('label', '==', 'a')], row_filter=True, engine='fastparquet')
print(df)
  label  value
0     a     25

If this answers your question, please, close the ticket.

martindurant commented 3 months ago

It's worth saying that row_filter is False by default for two reasons:

jscottcronin commented 3 months ago

Thank you for sharing. I wasn't aware of the row partitions in the parquet format and how row filtering is applied. Using row_filter=True did resolve this issue.

I'm curious if you all think pandas should be applying row_filter=True if filters are submitted in the fastparquet to_pandas calls. This would streamline the fastparquet implementation in pandas to be more consistent with other pandas io implementations. I can see the positives for making the pandas api more consistent (this issue caught me by surprise), but I'd also like to understand if there are reasons that we might not want to implement this change in pandas.

martindurant commented 3 months ago

One might suggest to pandas that (engine=="fastparquet" && filters) should imply row_filter=True; however, as I explained, there are reasons for it not to be the default from our point of view, so I tend to think that things should stay the same.