Closed ryanh-ai closed 7 years ago
Went through the api code for ParquetFile, it seems that it is designed to read the entire row-group if any of the filter conditions are met by and items within the row-group.
Is this understanding correct by design?
I suspect 'slicing' row-wise within a parquet file may be a challenge so fully implementing a filter may not work - I don't know parquet format well enough to know if fully implementing a filter is feasible.
Thoughts?
That's exactly right, filtering can happen two ways:
This has come up as a point of confusion, if the documentation can be improved, I'd be glad to see it.
You may further be interested in dask's dataframe.read_parquet(), which calls fastparquet behind the scenes. You can provide directory-partition-wise filters the same way, but the resultant dataframe (with pandas syntax) also intelligently handles filters on the index to avoid unnecessary reads and automatically filters rows within partitions.
Thanks for confirming!
with dataframe.read_parquet() - is it lazily reading the parquet files? i.e. is it pulling the metadata in and then as you access the dask dataframe, determining what segments it needs to pull off of disk?
Thanks!
exactly right
Possibly related to #148
When I do filter= on a field that wasn't included in the partitions, it only filters down to the nearest partition key.
To provide more information, my structure is:
options/_metadata --> symbol_letter='S' --> year=2017 --> quarter=1 --> quarter=2
The symbol_letter is the first letter of the stock symbol, inside these parquet files, there are many symbols - I am trying to slice out 'SPY' using:
But when I test whether the filter worked, i get the below (truncated):
Note that when I test whether the year filter worked, it is clearly working:
I am using a clone of fastparquet repo directly, so have all the latest commits included
Ideas?