fugue-project / fugue

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
https://fugue-tutorials.readthedocs.io/
Apache License 2.0
1.97k stars 94 forks source link

What do we do about Modin? #9

Closed goodwanghan closed 2 years ago

goodwanghan commented 4 years ago

Remove or make it work and testable?

goodwanghan commented 4 years ago

It is temporarily removed

talebzeghmi commented 2 years ago

May I ask, why was Modin removed?

goodwanghan commented 2 years ago

May I ask, why was Modin removed?

That was over a year ago, when the Fugue framework was not mature. Actually, for now, we are not sure if we should add Modin back, because we may be able to support Ray natively. Modin was buggy and slow at that time (compared to native dask), I don't know if it improves now. @talebzeghmi we would like to understand what you think about this? Are you a Modin user?

talebzeghmi commented 2 years ago

I'm not a Modin user, but like the idea of Modin and Pandas as true statistical DataFrames (as opposed to Spark/Dask), where Modin can be run on a cluster.

I have a few questions:

  1. Does Fugue use Spark Pandas now when running on Spark?
  2. Can Fugue run on Pandas with a sub millisecond overhead (ex: a serving scenario)?
goodwanghan commented 2 years ago

I'm not a Modin user, but like the idea of Modin and Pandas as true statistical DataFrames (as opposed to Spark/Dask), where Modin can be run on a cluster.

I have a few questions:

  1. Does Fugue use Spark Pandas now when running on Spark?
  2. Can Fugue run on Pandas with a sub millisecond overhead (ex: a serving scenario)?

They are very good questions.

Fugue can easily help users leverage Spark Pandas UDF for distributed computing without code change, and this can work on Spark>=2.4. Read this, it is just a single configuration. And yes, when using Pandas UDF, it can be significantly faster.

Philosophically, Fugue doesn't believe Pandas-like interface is a good way to do distributed computing. So to be clear, we are not another pandas-like framework.

And we don't have Spark Pandas (koalas) as a backend. Read this article from dask, who was regarded as the distributed pandas, now they want to differentiate themselves from pandas. I think this is because Pandas was designed with the assumption the compute happens on single machine's memory, so it doesn't have many of the important distributed concepts in their interface, for example, partitioning and persisting.

Being pandas-like, you will have to invent the missing part, but when you invent the missing part, you are not 100% consistent with pandas, interface-wise or behavior-wise, then without strong guarantee of consistency, what is the point of using Pandas interface? To keep the mindset? The mindset is problematic too. For example, pandas always does schema inference for users, but this can cause big perf issue or even correctness issue on a distributed framework. While Dask and Koalas have the interfaces to specify schema, how many pandas users would even care? They just treat it as another pandas.

In my opinion neither the interface nor the mindset of Pandas is good for distributed computing. But of course, this is just what we believe, you can disagree :) In this video, I talked about these in details.

Regarding the second question, No, we haven't got a chance to optimize the latency for serving scenarios. Our overhead is on sub-second level for now. But in the future, we may consider the ultra low latency scenarios, there are definitely things we can do for latency. If you are interested, you are very welcome to join us, and work on this problem (or even other interesting problems).