Question: Is the Ballista project providing value to the overall DataFusion project?

andygrove commented 3 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I've noticed from various discussions that Ballista is adding some (considerable?) overhead to the development of DataFusion and it doesn't look like there is much Ballista-specific development happening. Also, the integration tests do not currently appear to be working (https://github.com/apache/arrow-datafusion/issues/1272).

Does it still make sense to continue maintaining Ballista here? I am unable to contribute to Ballista myself at the moment and I don't think the project has reached sufficient momentum for others to be motivated to continue driving it forward, although I haven't been following closely and may have missed some discussions.

The benefits of having Ballista in the codebase are, in my opinion:

It helps keep physical plans serializable
It helps test some of the extension points
It serves as an advanced example for using DataFusion in a distributed context

What do others think?

bkmgit commented 3 years ago

What has been learned from running in a distributed context?
How does DataFusion with Ballista differ from other distributed databases and what could it offer to people who might adopt it?
There is interest in distributed memory processing using MPI/UCX (GRPC can also be used for this, but has lower performance though somewhat greater flexibility), but this may target Arrow directly rather than through DataFusion.
One issue that may arise in both settings is fault tolerance and replication.

Hoeze commented 3 years ago

As an interested user, I would highly appreciate if the datafusion project would keep distributed query execution as a first-class citizen, in the hope that at some point Ballista will replace my PySpark setup.

Keeping distributed computation in mind forces the project to use scalable solutions that make good use of available resources
I can just add more machines to my cluster when I need to run my query on a larger scale

Those two points are the main reason why I turned away from Pandas, Polars, etc. to PySpark. Even if PySpark is not as fast as e.g. Polars, it does an awesome job on resource management. If I run my scripts on my laptop, it takes ~20x longer but it will still complete the job. Compared to pandas which just breaks with OOM when your intermediate dataframe is larger than memory.

alamb commented 3 years ago

TDLR is I think Ballista hold much promise of value, but it needs a champion to push it forward; Without a champion, it I fear it will turn into a stagnated / vestigial project which is just overhead and leaves a great amount of potential value unlocked.

I am unable to contribute to Ballista myself at the moment and I don't think the project has reached sufficient momentum for others to be motivated to continue driving it forward, although I haven't been following closely and may have missed some discussions.

I think this is what Ballista is really missing -- a champion / cheerleader to drive its development forward. It seems to have mostly stalled since being brought into this repo.

The benefits of having Ballista in the codebase are, in my opinion: It helps keep physical plans serializable

Yes, I agree this is valuable, (and is actually the largest source of overhead in my opinion) -- any change to LogicalPlan or Expr requires being able to update the protobuf serialization / deserialization code.

I think @rdettai is out for the next few days / weeks but some things he said recently made me think he was planning to focus more on Ballista (perhaps our champion emerges...)

At some point IOx (my company's project) will care about distributed query execution, and at that point we may bring more energy to bear on Ballista, but realistically that is not likely to happen for 6+ more months.

houqp commented 3 years ago

I agree with @alamb that we could really benefit from having someone to actively drive ballista.

Even though I mentioned in my dev list email that ballista introduces extra overhead, I think it acted as a good forcing function for us to design datafusion changes with serailization in mind as you mentioned. For example, It triggered a lot of good discussions around object store serialization in @rdettai 's recent refactoring. If we decided that the overhead is not worth it later, we can just make ballista depend on datafusion by published crate version instead of path so it's less decoupled from datafusion code change. I think this should address all the overhead concerns. I don't think we need to do this right now though.

I think it's best for us to come back to the discuss of moving ballista out of datafusion when it gets significant momentum.

rdettai commented 3 years ago

Thanks @andygrove for starting this discussion. I did feel some overhead induced by Ballista on some of my recent PRs, especially from the serialization part (but not only). I was really happy to do that work, because I find it very interesting to see the internals of a distributed engine, but I could understand that it might be seen as an unpleasant constraint for contributors with limited bandwidth and no interest in the distributed system.

@alamb I am considering trying to merge Ballista with Buzz, but that represents a huge amount of work, because Ballista would require a fair amount of restructuring/modularization for it to work. I would be happy to take the challenge and push Ballista forward, but I would require some sort of sponsorship to work on it in a sustained fashion 😅. I have some leads already, but if you have any idea who might be willing to invest into this I am highly interested.

andygrove commented 3 years ago

Thanks for all the feedback so far. I think @alamb makes a very good point about Ballista needing a champion to drive it forward and I agree with that. Let's see what happens there.

frobnitzem commented 3 years ago

I am coming to this discussion from a different perspective. I recently wrote mpi-list, a py-sparkling-inspired RDD that distributes list slices over MPI. Each process works with its local (sub-list) of elements. Usually, each element is a dataframe.

However, loading up pandas dataframes quickly wastes memory. So, I'm investigating arrow as a memory-friendly replacement. The trouble I am running into is that DataFusion might have too much functionality. My csv files are already split up (many per process), and I already have processes running on an existing cluster via MPI. So I want to execute SQL queries once for each csv file and create a new result dataset distributed the same way as the original.

alamb commented 3 years ago

I would be happy to take the challenge and push Ballista forward, but I would require some sort of sponsorship to work on it in a sustained fashion 😅. I have some leads already, but if you have any idea who might be willing to invest into this I am highly interested.

I don't have anything for you at the moment @rdettai but I'll keep my ears open. @andygrove do you have any ideas (specifically any users / potential users of Ballista that might be willing to contribute time)?

houqp commented 3 years ago

The trouble I am running into is that DataFusion might have too much functionality. My csv files are already split up (many per process), and I already have processes running on an existing cluster via MPI. So I want to execute SQL queries once for each csv file and create a new result dataset distributed the same way as the original.

@frobnitzem I think this is a bit off topic and worth to be discussed in a separate issue. To answer your question, you can just use datafusion as a simple library to query a single csv file in process using sql. You don't have to use ballista.

realno commented 2 years ago

I recently started doing some work around DataFusion and seeing good potential with Ballista to become a candidate of our next-gen analytics engine. At this point I am trying to put in some missing pieces and get it working with some benchmark cases we are using for initial validation. If all goes well the plan is to start a team to help drive this forward. I am really interested in working with anyone who share similar interest to kick start the discussion. Please let me know if there are interests and we can use this thread the form an initial group. @andygrove it would be really helpful if we can pick your brain from time to time, and if you have time maybe share some of the things you were thinking to add.

My personal opinion is that Ballista makes sense to be part of DataFusion. It may potentially affect (and depend on) many things such as partition strategy, join strategy, planner, etc. I recognize there might be some overhead in development, maybe this is an opportunity for us to define some APIs to make it easier. @alamb @houqp Do you guys think this is reasonable?

@rdettai I am hoping you are still interested in this topic. Do you think the challenges you had can be solved by introducing an API layer between DataFusion and Ballista and we make some effort to keep it somewhat stable?

I am working on a high level proposal right now and hopefully can share in the next few weeks. Please feel free to let me know if you have any ideas, I am currently working on this besides my full time job so may need a little time.

houqp commented 2 years ago

This is great news @realno ! Thank you for driving this. I am interested in getting ballista to work with delta tables managed in S3 and potentially use it to replace some of the spark SQL jobs we have at work as a starting point. However, I am also just working on datafusion and ballista in my spare time, so the output is limited.

As for reducing development overhead, ballista currently only depends on public datafusion APIs, so if you have good ideas on how to restrict the api coupling between ballista and datafusion, then it's certainly a good thing. I haven't seen anymore express strong opinion on the overhead yet, so I don't think we are at a stage where we need to invest heavily on this. It is something we can keep an eye on while iterating on ballista.

realno commented 2 years ago

This is great news @realno ! Thank you for driving this. I am interested in getting ballista to work with delta tables managed in S3 and potentially use it to replace some of the spark SQL jobs we have at work as a starting point. However, I am also just working on datafusion and ballista in my spare time, so the output is limited.

As for reducing development overhead, ballista currently only depends on public datafusion APIs, so if you have good ideas on how to restrict the api coupling between ballista and datafusion, then it's certainly a good thing. I haven't seen anymore express strong opinion on the overhead yet, so I don't think we are at a stage where we need to invest heavily on this. It is something we can keep an eye on while iterating on ballista.

Thanks @houqp for the insight. This is good to hear there are already public API boundaries in place. And my hope is very aligned with yours that to replace some Spark Jobs at some point :) I will reach out once I have some thing ready for review/discussion, feel free to let me know if anything I can help with too.

liukun4515 commented 2 years ago

I recently started doing some work around DataFusion and seeing good potential with Ballista to become a candidate of our next-gen analytics engine. At this point I am trying to put in some missing pieces and get it working with some benchmark cases we are using for initial validation. If all goes well the plan is to start a team to help drive this forward. I am really interested in working with anyone who share similar interest to kick start the discussion. Please let me know if there are interests and we can use this thread the form an initial group. @andygrove it would be really helpful if we can pick your brain from time to time, and if you have time maybe share some of the things you were thinking to add.

My personal opinion is that Ballista makes sense to be part of DataFusion. It may potentially affect (and depend on) many things such as partition strategy, join strategy, planner, etc. I recognize there might be some overhead in development, maybe this is an opportunity for us to define some APIs to make it easier. @alamb @houqp Do you guys think this is reasonable?

@rdettai I am hoping you are still interested in this topic. Do you think the challenges you had can be solved by introducing an API layer between DataFusion and Ballista and we make some effort to keep it somewhat stable?

I am working on a high level proposal right now and hopefully can share in the next few weeks. Please feel free to let me know if you have any ideas, I am currently working on this besides my full time job so may need a little time.

Looking forward to seeing your proposal. If you need help or discussion, please feel free to let me know.

matthewmturner commented 2 years ago

@realno @liukun4515 do you have any plans for ballista that you think are worth adding to the Q1 roadmap?

liukun4515 commented 2 years ago

@realno @liukun4515 do you have any plans for ballista that you think are worth adding to the Q1 roadmap?

@matthewmturner

Sorry for my limited time, Maybe I can be involved in Q2 or later. In the Q1, I have other things to do, but I will try my best to track this.

alamb commented 2 years ago

My personal opinion is that Ballista makes sense to be part of DataFusion. It may potentially affect (and depend on) many things such as partition strategy, join strategy, planner, etc. I recognize there might be some overhead in development, maybe this is an opportunity for us to define some APIs to make it easier. @alamb @houqp Do you guys think this is reasonable?

I think this is very reasonable. I view datafusion as a single node, sharable query engine that isn't really designed to be used by itself -- rather it can be used to create systems such as ballista (and other projects)

Having ballista in the same repo as a way to validate datafusion API changes seems valuable to me 👍 and I am happy to have some minor extra overhead on DataFusion API changes. What I really want to see is Ballista actively used / driven forward by the community, which it sounds like @realno is preparing to propose.

cc @yahoNanJing and @gaojun2048 who have opened PRs with contributions for Ballista recently

matthewmturner commented 2 years ago

@alamb Actually I'm quite curious on the point of datafusion not being used standalone.

On my side, my plan was to use datafusion (likely via the Python bindings) until my data size warranted using ballista. I thought it was a nice selling point that I could use the same underlying engine for either single node or distributed compute.

Is having first class support for using datafusion standalone at odds with it being a component of a larger system?

Also aren't the datafusion Python bindings meant to enable using it standalone?

realno commented 2 years ago

@realno @liukun4515 do you have any plans for ballista that you think are worth adding to the Q1 roadmap?

@matthewmturner It is a bit ambitious to answer this question at this moment. My plan is to have a high level proposal first, and I am pretty sure it will be going through a couple of iterations based on community feedback. I think then would be a good time to talk about roadmap and timeline.

matthewmturner commented 2 years ago

@realno ok! IMHO, I think that even mentioning that a plan is being developed is worth mentioning just so we can show that there are plans for ballista development even if they are just in the planning stage. The alternative makes it look like Ballista isn't getting any focus when that isn't the case.

So if okay with you I can put something like "Putting together design docs, plan, and priorities for Ballista"

alamb commented 2 years ago

@alamb Actually I'm quite curious on the point of datafusion not being used standalone.

On my side, my plan was to use datafusion (likely via the Python bindings) until my data size warranted using ballista. I thought it was a nice selling point that I could use the same underlying engine for either single node or distributed compute.

Ah sorry for my confusion -- in my mental model, the python bindings are one example of a system used with datafusion (rather than "datafusion itself" -- which means the contents of the datafusion crate). I realize this terminology is likely not standard and I apologize for any confusion it caused.

My point was I expected the datafusion crate to be used to build many other systems people used directly, rather than directly itself. Which perhaps is obvious

matthewmturner commented 2 years ago

@alamb Got it - thanks for clarifying!

thinkharderdev commented 2 years ago

Late to the party here but my team is very excited about the potential of Ballista and are interested in helping push the project forward.

alamb commented 2 years ago

Given the recent development in ballista, I would say the answer to this question is not an unequivocal "YES" and this ticket doesn't need to remain open

apache / datafusion

Question: Is the Ballista project providing value to the overall DataFusion project? #1273