EthanRosenthal / medium-data-bakeoff

A python library bakeoff for medium sized datasets
MIT License
23 stars 7 forks source link

Re-Enable Dask, add Modin and Dask on Ray #12

Closed sullivancolin closed 1 year ago

sullivancolin commented 1 year ago

This project is great!

I was able to get the dask version to run relatively well by making a small tweak to the way the data is read in. But that might violate the spirit of your bakeoff.

I changed:

df = dd.read_parquet(dataset, index=False)

to

df = dd.read_parquet(dataset, index=False, columns=["station_id", "num_bikes_available"])

This allows dask to only read in the columns needed for the groupby().mean() and run much quicker. But again, that's probably not obvious to typical pandas users and therefore maybe a bit of a cheat.

I also added modin and dask on ray as experiments. I can submit a PR if you would like these to be included. I have not yet tried modin in dask.

Here is the rough benchmark I ended up with:

benchmark_50

Let me know your feedback!

EthanRosenthal commented 1 year ago

Awesome!

I do agree that the dask optimization maaaayyyy be a bit cheating. That said, I do like being able to once again see its performance on the plot. I'll add it in and add it into the bakeoff but mark it as dask* with an explanation in the readme about this cheat.

In terms of modin and dask on ray, a PR would be very welcome! I tried messing around with modin on ray last night but ran into some issues running it on my machine. We'll see if this is a "my machine" thing or if your implementation fixes my issues.