Re-Enable Dask, add Modin and Dask on Ray

EthanRosenthal / medium-data-bakeoff

A python library bakeoff for medium sized datasets

MIT License

23 stars 7 forks source link

This project is great!

I was able to get the dask version to run relatively well by making a small tweak to the way the data is read in. But that might violate the spirit of your bakeoff.

I changed:

df = dd.read_parquet(dataset, index=False)

df = dd.read_parquet(dataset, index=False, columns=["station_id", "num_bikes_available"])

This allows dask to only read in the columns needed for the groupby().mean() and run much quicker. But again, that's probably not obvious to typical pandas users and therefore maybe a bit of a cheat.

I also added modin and dask on ray as experiments. I can submit a PR if you would like these to be included. I have not yet tried modin in dask.

Here is the rough benchmark I ended up with:

benchmark_50

Let me know your feedback!

EthanRosenthal / medium-data-bakeoff

Re-Enable Dask, add Modin and Dask on Ray #12