Closed data-steve closed 7 years ago
I am in the process of releasing fastparquet today. Would you mind trying your code with the laster master version of fastparquet ?
with a pip install git+https://github.com/dask/fastparquet/
here my output from same script above. It works. In this one I've included the _metadata files in the folder.
Weirdly though, it seems faster by half when I don't include the metadata files.
thanks
OK, good to know it works. I would probably ascribe the faster running in your second round to caching, but it would be good to find that out for sure.
In general, a performance of fastparquet versus arrow will depend on a number of factors, such as the type of data stored and which framework wrote it.
Here's the times for them with on metadata . Each round they are in different order and an interpreter restart between each one. The dataframe is from 5 snappy.parquet files written out by pyspark.parquet() and results in a shape (126722, 16) of dtypes mostly object with some int32.
Note that there is a startup cost when calling fastparquet for the first time to compile the numba-decorated functions. You might also be interested in https://github.com/andrix/python-snappy/pull/38 , which will improve performance for you if you were running the above with threads as opposed to processes. Also, I invite you to write the data out with dask/fastparquet and then time the reading - you will likely find a great improvement, especially if you convert as many object columns as appropriate to categorical.
Thanks those are good tips. Typical workflow our boss wants is filtering/munging/joins in Spark and then use python later bec it's asssumed to be faster on databases dumps of 10s of millions of rows each. So that's why the parquet are the way they are.
I'm trying as much as possible to move everything to python. And dask is a great option for this. I just have to find time to solve all the logistics that my team already solved w Spark in the pipeline. I'd really love a way to see this all end to end dask.
~ Steve
Sent via telepathy
On May 4, 2017, at 12:47 PM, Martin Durant notifications@github.com wrote:
Note that there is a startup cost when calling fastparquet for the first time to compile the numba-decorated functions. You might also be interested in andrix/python-snappy#38 , which will improve performance for you if you were running the above with threads as opposed to processes. Also, I invite you to write the data out with dask/fastparquet and then time the reading - you will likely find a great improvement, especially if you convert as many object columns as appropriate to categorical.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.
After closing out issue #142, I updated pyarrow, fastparquet, and dask from conda-forge:
conda install fastparquet pyarrow dask -c conda-forge
I refresh my IDE's python interpreter and did some pyarrow stuff on just one file and a whole glob of files. All that works great.
With arrow as the engine I can read the whole file glob or just one file, even without the metafiles, which I had to generate before to get fastparquet to work:
writer.merge(filelist[1:])
. I think pyarrow auto-generates itself the metadata off the first parquet file in the glob, as of a couple months ago.When I switch the engine to
auto
and try to use the same syntax as with arrow as engine, I get type errors problems because of how to specify the path.If I pass just one file to it, I get a new error, which showed up in #127 also.
I get the same errors as I do with 'auto' when I use 'fastparquet' which leads me to assume 'auto' means 'fastparquet' at the moment, which the docs confirm
I also tried changing how I declare where to look for the data
df = read_parquet(data_path+"/", engine='fastparquet')
because of what I read in issues like these #137.Aside from the errors I'm getting, from a user-experience perspective (which is all I'm qualified to give), if
dask.read_parquet
is going to allow 2 engines, it'd be nice to have them both work using the same top-level API in terms of what gets fed intopath
param. Otherwise, maybe 'arrow' needs to be the default setting for 'auto' since it just works on a single file or files and not needing the extra steps of using thewriter.merge(filelist[1:])
when there is no metadata.