Open austin362667 opened 5 days ago
Ray SQL project: https://github.com/datafusion-contrib/ray-sql
@franklsf95 fyi
Things I'm thinking of that should be showcased or discussed:
Is the complexity of Ballista coming from cluster management or query planning? If Ray SQL can achieve similar or better performance than Ballista with only 1.7k loc by delegating cluster management to Ray and only keeping the distributed query planning responsibility, that's pretty amazing !
Is the complexity of Ballista coming from cluster management or query planning?
Ballista and Ray SQL use the same query planning code. The complexity comes from cluster management/communication.
- Setup and testing of ballista on tpc-h
Since upgrading to a more recent DataFusion version, Ballista is no longer capable of running the TPC-H suite. It works fine in the 0.12.0 release.
I am curious about the right boundary here. If we go through with this, should the code become a part of datafusion-python, or should the former stay as a single node thing focusing on excellent bindings to core, and we should have a "datafusion-ray" that uses datafusion-python as a library?
Thanks @ozankabak I'm curious on the boundary, too.
One of the goals in Ray SQL is "Drive requirements for DataFusion's Python bindings". So my guess is that if we wanna use datafusion-python
as a library for separate datafusion-ray
, we might need to ensure APIs of datafusion-ray
is a subset of datafusion-python
APIs. Otherwise it'll not be able to easily expose complete bindings to python side via PyO3.
Indeed. Do you have an idea of how the subset would look like?
To expand on this, there are several options for adopting the Ray SQL code as part of the DataFusion project.
datafusion-ray-sql
(possibly renamed to datafusion-ray
or something else)100% yes. If Ray SQL can handle all TPC-H queries with just 1.7k lines of code, it’s sounds like a no-brainer. This actually makes a production-ready distributed DataFusion sound achievable and it should attract more contributors. Can't think of a single reason not to go this route.
One question I have is who would maintain this new code? Ballista I think suffers from a slow decline due to lack of active maintenance and community. We should try to avoid the same thing happening to Ray
Docs/Readme.md probably needs a quick update for anyone joining the catchup. Distributed shuffle using rays object store was implemented in #33.
Get Outlook for Androidhttps://aka.ms/AAb9ysg
From: Andrew Lamb @.> Sent: Monday, September 16, 2024 6:32:22 PM To: apache/datafusion-python @.> Cc: Jordan Fox @.>; Comment @.> Subject: Re: [apache/datafusion-python] Proposal to Introduce Ray SQL into DataFusion Python (Issue #872)
Caution: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
One question I have is who would maintain this new code? Ballista I think suffers from a slow decline due to lack of active maintenance and community. We should try to avoid the same thing happening to Ray
— Reply to this email directly, view it on GitHubhttps://github.com/apache/datafusion-python/issues/872#issuecomment-2354266142, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHXVCFKTRUINKJ527GAANVDZW52BNAVCNFSM6AAAAABOIBWHDCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJUGI3DMMJUGI. You are receiving this because you commented.Message ID: @.***>
Here is the PR in ray-sql that added support for distributed shuffle using the Ray object store:
One question I have is who would maintain this new code? Ballista I think suffers from a slow decline due to lack of active maintenance and community. We should try to avoid the same thing happening to Ray
That's a great point. If the project were to become part of the Apache DataFusion project then I would certainly put time into maintaining it and helping build community around the project. I am not able to contribute in its current location.
I have recently been attempting to maintain Ballista by upgrading to more recent versions of DataFusion, but the project is large and complex and the original contributors of much of this code are no longer available to help, so it is challenging.
I believe that DataFusion + Ray is an opportunity to start fresh on a solution for distributed DataFusion as a much lighter weight alternative to Ballista and the project is small enough (~40 commits) that it will be easier for new contributors to follow along.
These are the initial tasks that I would plan on working on (with the community, hopefully) if we were to move forward with this proposal.
Another possibility is that interested contributors could start maintaining the project in its current location, but I am not sure who would be able to approve the PRs.
Given that I am +1 on this proposal 🚀
Another possibility is that interested contributors could start maintaining the project in its current location, but I am not sure who would be able to approve the PRs.
I think I have admin rights to the https://github.com/datafusion-contrib org and so can add / remove people from the project as necessary. Just let me know
Really excited to see this happening! I contributed some code to Ray SQL last year (most notably, making distributed shuffle work using Ray's distributed object store), and can help answer any question regarding Ray (I work with people who build Ray, Ray Data, etc.)
On a high level, by building on top of Ray, you get a distributed execution substrate for free. Ray handles managing the cluster, scheduling tasks, managing distributed memory, fault tolerance, to name a few. This would mean basically to use DataFusion as a single-node query execution engine and build all the distributed stuff in Python on top of Ray. If this is in line with the goal of this project (or the DataFusion project), then I think it would be a good way to go.
I've been thinking some more about where this code should live.
I am now leaning towards putting the code into a new standalone datafusion-ray
repository.
My reasoning for this:
datafusion-python
project and I don't think that we should put an additional burden on these maintainersdatafusion-python
as a single node thing focusing on excellent bindings to core and then have datafusion-ray
depend on datafusion-python
. This will be an excellent way to ensure that datafusion-python
has sufficient extension points to enable distributed use cases.datafusion-python
repo with unmaintained codeI have been thinking a lot about this and how it would fit into or use datafusion-python
. This is the third case I've come across of needing some kind of way to extend datafusion-python
. The other two are delta-rs and a personal project where I could really benefit from adding a custom ExecutionPlan
. I keep going back and forth between two ideas
datafusion-ray
but other projects like delta-rs
don't want such a requirement.datafusion
so we can use that as a safe intermediary between the projects. I have a couple of minimal examples of doing this in a branch, but as you start getting into anything that touches the SessionContext
the requirements for what is included in the FFI can explode.Regarding the question of where it should sit, I would recommend a separate repo datafusion-ray
which will force the issue on making datafusion-python
more extensible.
Seems like everyone is on the same page, preferring DataFusion
+ Ray
over Ballista
, we just haven't decide how to do that (so many options).
Based on @andygrove's reasonings, sounds a good idea starting a separate datafusion-ray
project under Apache DataFusion, aimed at replacing Ballista. This will keep things clean and avoid mixing datafusion-python
with datafusion-ray
, preventing a messy, hodgepodge integration. Look forward to it!
Following this way, this proposal is not relevant in datafusion-python anymore. What's the next actionable items? Would love to help~
It does sound like we have some concensus. The next steps are to start the IP clearance process. I will make a start on that this weekend. It basically involves filling out a form and starting a vote. Once the vote passes, I can create the new datafusion-ray
repo and we can push the code there and archive the original repo.
@franklsf95 Would you be able to file an ICLA (unless you already have one)? The instructions are at https://www.apache.org/licenses/contributor-agreements.html
@austin362667 Before we can start the IP clearance process, we need to update the existing source files to use the ASF header. Is that something you would be able to take on?
@andygrove Thanks for updating the title and yes, I can help. I'll open PR to https://github.com/datafusion-contrib/ray-sql that adds:
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
The PR Add ASF license header
@franklsf95 Would you be able to file an ICLA (unless you already have one)? The instructions are at https://www.apache.org/licenses/contributor-agreements.html
@austin362667 Before we can start the IP clearance process, we need to update the existing source files to use the ASF header. Is that something you would be able to take on?
Just submitted one to secretary@apache.org.
@austin362667 I created the new repository: https://github.com/apache/datafusion-ray
Could you open a draft PR against this repository to add the code including ASF headers?
Once this is done, the next steps will be:
Thanks.
I ran some benchmarks from @timsaucer's repo where he has upgraded to DataFusion 41.
Performance is looking good.
Thanks, @andygrove The result looks super promising!
And here is the draft PR https://github.com/apache/datafusion-ray/pull/1 for the Datafusion Ray donation.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In light of the complexities associated with maintaining and debugging Ballista, the community would like to propose exploring the adoption of Ray SQL within the DataFusion Python project.
Ray SQL, with only 1.7k lines of Rust, is significantly simpler compared to Ballista’s 27k lines. Despite its smaller codebase, Ray SQL has demonstrated the ability to run all TPC-H queries, showcasing its robustness and simplicity. This reduction in complexity could ease maintenance, lower the learning curve for contributors, and provide a functional distributed SQL solution that is much simpler than Ballista because Ray provides so much – there is no need for us to build scheduler and executor processes – we can simply execute Python tasks in the Ray cluster.
If there is enough interest and support, we could start working on bringing the Ray SQL code into the DataFusion Python project.
Describe the solution you'd like
Bringing the Ray SQL prototype into the DataFusion Python project
Describe alternatives you've considered
Building a new version inspired by the Ray SQL code
Additional context
We might need to go through the Apache IP clearance process for importing the external Ray SQL codebase in
datafusion-contrib
which is not part of Apache.