apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
351 stars 70 forks source link

Proposal to donate Ray SQL to the DataFusion Project (not into the Python subproject) #872

Open austin362667 opened 5 days ago

austin362667 commented 5 days ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In light of the complexities associated with maintaining and debugging Ballista, the community would like to propose exploring the adoption of Ray SQL within the DataFusion Python project.

Ray SQL, with only 1.7k lines of Rust, is significantly simpler compared to Ballista’s 27k lines. Despite its smaller codebase, Ray SQL has demonstrated the ability to run all TPC-H queries, showcasing its robustness and simplicity. This reduction in complexity could ease maintenance, lower the learning curve for contributors, and provide a functional distributed SQL solution that is much simpler than Ballista because Ray provides so much – there is no need for us to build scheduler and executor processes – we can simply execute Python tasks in the Ray cluster.

If there is enough interest and support, we could start working on bringing the Ray SQL code into the DataFusion Python project.

Describe the solution you'd like

Following options are suggested by @andygrove , all feedbacks are welcome!

Bringing the Ray SQL prototype into the DataFusion Python project

Describe alternatives you've considered

Building a new version inspired by the Ray SQL code

Additional context

We might need to go through the Apache IP clearance process for importing the external Ray SQL codebase in datafusion-contrib which is not part of Apache.

andygrove commented 5 days ago

Ray SQL project: https://github.com/datafusion-contrib/ray-sql

andygrove commented 5 days ago

@franklsf95 fyi

jordandakota commented 5 days ago

Things I'm thinking of that should be showcased or discussed:

edmondop commented 5 days ago

Is the complexity of Ballista coming from cluster management or query planning? If Ray SQL can achieve similar or better performance than Ballista with only 1.7k loc by delegating cluster management to Ray and only keeping the distributed query planning responsibility, that's pretty amazing !

andygrove commented 5 days ago

Is the complexity of Ballista coming from cluster management or query planning?

Ballista and Ray SQL use the same query planning code. The complexity comes from cluster management/communication.

andygrove commented 5 days ago
  • Setup and testing of ballista on tpc-h

Since upgrading to a more recent DataFusion version, Ballista is no longer capable of running the TPC-H suite. It works fine in the 0.12.0 release.

ozankabak commented 5 days ago

I am curious about the right boundary here. If we go through with this, should the code become a part of datafusion-python, or should the former stay as a single node thing focusing on excellent bindings to core, and we should have a "datafusion-ray" that uses datafusion-python as a library?

austin362667 commented 5 days ago

Thanks @ozankabak I'm curious on the boundary, too. One of the goals in Ray SQL is "Drive requirements for DataFusion's Python bindings". So my guess is that if we wanna use datafusion-python as a library for separate datafusion-ray, we might need to ensure APIs of datafusion-ray is a subset of datafusion-python APIs. Otherwise it'll not be able to easily expose complete bindings to python side via PyO3.

ozankabak commented 5 days ago

Indeed. Do you have an idea of how the subset would look like?

andygrove commented 4 days ago

To expand on this, there are several options for adopting the Ray SQL code as part of the DataFusion project.

  1. Create new repo datafusion-ray-sql (possibly renamed to datafusion-ray or something else)
  2. Replace the current Ballista code with Ray SQL (Ballista is already known as the distributed version of DataFusion)
  3. Add as an optional module of the DataFusion Python bindings
vakarisbk commented 4 days ago

100% yes. If Ray SQL can handle all TPC-H queries with just 1.7k lines of code, it’s sounds like a no-brainer. This actually makes a production-ready distributed DataFusion sound achievable and it should attract more contributors. Can't think of a single reason not to go this route.

alamb commented 4 days ago

One question I have is who would maintain this new code? Ballista I think suffers from a slow decline due to lack of active maintenance and community. We should try to avoid the same thing happening to Ray

jordandakota commented 4 days ago

Docs/Readme.md probably needs a quick update for anyone joining the catchup. Distributed shuffle using rays object store was implemented in #33.

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Andrew Lamb @.> Sent: Monday, September 16, 2024 6:32:22 PM To: apache/datafusion-python @.> Cc: Jordan Fox @.>; Comment @.> Subject: Re: [apache/datafusion-python] Proposal to Introduce Ray SQL into DataFusion Python (Issue #872)

Caution: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

One question I have is who would maintain this new code? Ballista I think suffers from a slow decline due to lack of active maintenance and community. We should try to avoid the same thing happening to Ray

— Reply to this email directly, view it on GitHubhttps://github.com/apache/datafusion-python/issues/872#issuecomment-2354266142, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHXVCFKTRUINKJ527GAANVDZW52BNAVCNFSM6AAAAABOIBWHDCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJUGI3DMMJUGI. You are receiving this because you commented.Message ID: @.***>

andygrove commented 3 days ago

Here is the PR in ray-sql that added support for distributed shuffle using the Ray object store:

https://github.com/datafusion-contrib/ray-sql/pull/33

andygrove commented 3 days ago

One question I have is who would maintain this new code? Ballista I think suffers from a slow decline due to lack of active maintenance and community. We should try to avoid the same thing happening to Ray

That's a great point. If the project were to become part of the Apache DataFusion project then I would certainly put time into maintaining it and helping build community around the project. I am not able to contribute in its current location.

I have recently been attempting to maintain Ballista by upgrading to more recent versions of DataFusion, but the project is large and complex and the original contributors of much of this code are no longer available to help, so it is challenging.

I believe that DataFusion + Ray is an opportunity to start fresh on a solution for distributed DataFusion as a much lighter weight alternative to Ballista and the project is small enough (~40 commits) that it will be easier for new contributors to follow along.

These are the initial tasks that I would plan on working on (with the community, hopefully) if we were to move forward with this proposal.

Another possibility is that interested contributors could start maintaining the project in its current location, but I am not sure who would be able to approve the PRs.

alamb commented 3 days ago

Given that I am +1 on this proposal 🚀

alamb commented 3 days ago

Another possibility is that interested contributors could start maintaining the project in its current location, but I am not sure who would be able to approve the PRs.

I think I have admin rights to the https://github.com/datafusion-contrib org and so can add / remove people from the project as necessary. Just let me know

franklsf95 commented 2 days ago

Really excited to see this happening! I contributed some code to Ray SQL last year (most notably, making distributed shuffle work using Ray's distributed object store), and can help answer any question regarding Ray (I work with people who build Ray, Ray Data, etc.)

On a high level, by building on top of Ray, you get a distributed execution substrate for free. Ray handles managing the cluster, scheduling tasks, managing distributed memory, fault tolerance, to name a few. This would mean basically to use DataFusion as a single-node query execution engine and build all the distributed stuff in Python on top of Ray. If this is in line with the goal of this project (or the DataFusion project), then I think it would be a good way to go.

andygrove commented 2 days ago

I've been thinking some more about where this code should live.

I am now leaning towards putting the code into a new standalone datafusion-ray repository.

My reasoning for this:

timsaucer commented 2 days ago

I have been thinking a lot about this and how it would fit into or use datafusion-python. This is the third case I've come across of needing some kind of way to extend datafusion-python. The other two are delta-rs and a personal project where I could really benefit from adding a custom ExecutionPlan. I keep going back and forth between two ideas

Regarding the question of where it should sit, I would recommend a separate repo datafusion-ray which will force the issue on making datafusion-python more extensible.

austin362667 commented 1 day ago

Seems like everyone is on the same page, preferring DataFusion + Ray over Ballista, we just haven't decide how to do that (so many options).

Based on @andygrove's reasonings, sounds a good idea starting a separate datafusion-ray project under Apache DataFusion, aimed at replacing Ballista. This will keep things clean and avoid mixing datafusion-python with datafusion-ray, preventing a messy, hodgepodge integration. Look forward to it!

Following this way, this proposal is not relevant in datafusion-python anymore. What's the next actionable items? Would love to help~

andygrove commented 1 day ago

It does sound like we have some concensus. The next steps are to start the IP clearance process. I will make a start on that this weekend. It basically involves filling out a form and starting a vote. Once the vote passes, I can create the new datafusion-ray repo and we can push the code there and archive the original repo.

https://incubator.apache.org/ip-clearance/

andygrove commented 1 day ago

@franklsf95 Would you be able to file an ICLA (unless you already have one)? The instructions are at https://www.apache.org/licenses/contributor-agreements.html

@austin362667 Before we can start the IP clearance process, we need to update the existing source files to use the ASF header. Is that something you would be able to take on?

austin362667 commented 1 day ago

@andygrove Thanks for updating the title and yes, I can help. I'll open PR to https://github.com/datafusion-contrib/ray-sql that adds:

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
austin362667 commented 1 day ago

The PR Add ASF license header

franklsf95 commented 1 day ago

@franklsf95 Would you be able to file an ICLA (unless you already have one)? The instructions are at https://www.apache.org/licenses/contributor-agreements.html

@austin362667 Before we can start the IP clearance process, we need to update the existing source files to use the ASF header. Is that something you would be able to take on?

Just submitted one to secretary@apache.org.

andygrove commented 1 day ago

@austin362667 I created the new repository: https://github.com/apache/datafusion-ray

Could you open a draft PR against this repository to add the code including ASF headers?

Once this is done, the next steps will be:

  1. Hold a vote on the DataFusion mailing list for accepting the donation based on this PR
  2. Once that vote passes, start the IP clearance vote
  3. Once that vote passes, merge the PR

Thanks.

andygrove commented 1 day ago

I ran some benchmarks from @timsaucer's repo where he has upgraded to DataFusion 41.

Performance is looking good.

Screenshot from 2024-09-19 19-55-55

austin362667 commented 1 day ago

Thanks, @andygrove The result looks super promising!

And here is the draft PR https://github.com/apache/datafusion-ray/pull/1 for the Datafusion Ray donation.