dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
787 stars 178 forks source link

filter= only filtering on partitions, not on other fields #153

Closed ryanh-ai closed 7 years ago

ryanh-ai commented 7 years ago

Possibly related to #148

When I do filter= on a field that wasn't included in the partitions, it only filters down to the nearest partition key.

To provide more information, my structure is:

options/_metadata --> symbol_letter='S' --> year=2017 --> quarter=1 --> quarter=2

The symbol_letter is the first letter of the stock symbol, inside these parquet files, there are many symbols - I am trying to slice out 'SPY' using:

df2 = pf2.to_pandas(filters=[('underlyingsymbol', '==', 'SPY'), ('year', '==', 2017)])

But when I test whether the filter worked, i get the below (truncated):

In [8]: df2.underlyingsymbol.unique()
Out[8]:
array([u'S', u'SA', u'SAA', u'SABR', u'SAFM', u'SAFT', u'SAGE', u'SAH',
       u'SAIA', u'SAIC', u'SALE', u'SALM', u'SALT', u'SAM', u'SAN',
       u'SAND', u'SANM', u'SANW', u'SAP', u'SASR', u'SATS', u'SAVE', u'SB',
       u'SBAC', u'SBCF', u'SBGI', u'SBGL', u'SBH', u'SBIO', u'SBLK',
       u'SBNY', u'SBRA', u'SBS', u'SBSI', u'SBUX', u'SBY', u'SC', u'SCAI',
       u'SCCO', u'SCG', u'SCHA', u'SCHB', u'SCHC', u'SCHD', u'SCHE',
       u'SCHF', u'SCHG', u'SCHH', u'SCHL', u'SCHM', u'SCHN', u'SCHP',
       u'SCHV', u'SCHW', u'SCHX', u'SCI', u'SCIF', u'SCJ', u'SCL', u'SCLN',
       u'SCM', u'SCMP', u'SCNB', u'SCO', u'SCOR', u'SCS', u'SCSC', u'SCSS',
       u'SCVL', u'SCWX', u'SCYX', u'SCZ', u'SD', u'SDIV', u'SDLP', u'SDOG',
       u'SDOW', u'SDR', u'SDRL', u'SDS', u'SDT', u'SDY', u'SE', u'SEA',
       u'SEAC', u'SEAS', u'SEDG', u'SEE', u'SEED', u'SEF', u'SEIC', u'SEM',
       u'SEMG', u'SEP', u'SERV', u'SF', u'SFBS', u'SFC', u'SFE', u'SFL',
       u'SFLY', u'SFM', u'SFNC', u'SFR', u'SFS', u'SFUN', u'SGBK', u'SGC',
       u'SGEN', u'SGG', u'SGMO', u'SGMS', u'SGOL', u'SGRY', u'SGU',
       u'SGYP', u'SH', u'SHAK', u'SHEN', u'SHLD', u'SHLM', u'SHLO',
       u'SHLX', u'SHO', u'SHOO', u'SHOP', u'SHOR', u'SHOS', u'SHPG',
       u'SHW', u'SHY', u'SID', u'SIEN', u'SIFY', u'SIG', u'SIGI', u'SIGM',

Note that when I test whether the year filter worked, it is clearly working:

In [11]: df2.year.unique()
Out[11]:
[2017]
Categories (1, int64): [2017]

I am using a clone of fastparquet repo directly, so have all the latest commits included

Ideas?

ryanh-ai commented 7 years ago

Went through the api code for ParquetFile, it seems that it is designed to read the entire row-group if any of the filter conditions are met by and items within the row-group.

Is this understanding correct by design?

I suspect 'slicing' row-wise within a parquet file may be a challenge so fully implementing a filter may not work - I don't know parquet format well enough to know if fully implementing a filter is feasible.

Thoughts?

martindurant commented 7 years ago

That's exactly right, filtering can happen two ways:

This has come up as a point of confusion, if the documentation can be improved, I'd be glad to see it.

You may further be interested in dask's dataframe.read_parquet(), which calls fastparquet behind the scenes. You can provide directory-partition-wise filters the same way, but the resultant dataframe (with pandas syntax) also intelligently handles filters on the index to avoid unnecessary reads and automatically filters rows within partitions.

ryanh-ai commented 7 years ago

Thanks for confirming!

with dataframe.read_parquet() - is it lazily reading the parquet files? i.e. is it pulling the metadata in and then as you access the dask dataframe, determining what segments it needs to pull off of disk?

Thanks!

martindurant commented 7 years ago

exactly right