apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.13k stars 1.16k forks source link

Looser coupling with tokio #12784

Open brancz opened 1 week ago

brancz commented 1 week ago

Is your feature request related to a problem or challenge?

We would like to use datafusion in a thread-per-core/shared-nothing architecture, for that and other reasons we'd like to not use tokio as the async runtime.

Describe the solution you'd like

Remove uses of concrete tokio things where not truly necessary, and put truly required uses behind a feature so that users that don't want to use (or even pull in) tokio can still accomplish that.

Based on a shallow analysis, the core and physical plan packages would be the most challenging ones, the others all already don't, or use tokio only as a dev dependency, or the usage is minimal and could be easily removed.

Describe alternatives you've considered

I don't see a way other than forking datafusion as an alternative to this.

Additional context

I think it's fine to still treat tokio as the first class citizen, and not provide alternatives within datafusion for cases where it's unavoidable. I imagine anyone who wants this level of access wants extremely high control over I/O abstractions anyway.

jayzhan211 commented 1 week ago

I think tokio by default spawn single thread per core 🤔 https://docs.rs/tokio/latest/tokio/#:~:text=The%20core%20threads%20are%20where,to%20override%20the%20default%20value.

brancz commented 1 week ago

Thread per core doesn't necessarily imply shared nothing, but as I said, there are other reasons for us not to use tokio, for example, another reason is that it's not well simulatable in the context of deterministic simulation testing (madsim exists, but is a bit of a hack).

tustvold commented 1 week ago

For context moving DataFusion away from tokio has been attempted before https://github.com/apache/datafusion/issues/2504, however, this initiative was abandoned in https://github.com/apache/datafusion/pull/6169 partly due to lack of time but also the sheer complexity and unclear return of such an undertaking.

If the intent is to simply allow using runtimes exposing a similar async-executor pattern as tokio (e.g. async-std), I think that is likely achievable. Where it gets tricky is exposing runtimes with fundamentally different interfaces to tokio, such as !Send requirements (e.g. glommio), lack of async support (e.g. rayon), etc... Given discussions on Discord, the latter case I understand is what you're asking for.

the core and physical plan packages would be the most challenging ones I don't see a way other than forking datafusion as an alternative to this.

To be completely honest, forking / re-implementing the physical plan and execution components is likely the only practical way to achieve such a fundamental change at this stage. Trying to rewire the physical plan in-place is likely to end up in knots, especially if the desire is to make fundamental changes to the concurrency model to a thread-per-core / morsel-driven model.

I personally think the concurrency story in DF could be significantly improved https://github.com/apache/datafusion/issues/12393 and we see a lot of users running into issues related to the current story, both as contributors getting tripped up by the complexities of Rust's async, and users getting into issues mixing IO and CPU on the same runtimes.

However, I cannot emphasise enough that this would be a major undertaking that would require significant buy in from the DF community and maintainers, and so it would really be up to them if and how they would wish to take such an initiative forward.

alamb commented 1 week ago

Yeah I think it would come down to

  1. what is the actual usecase (and how many others have the same one)
  2. How much work would it actually entail
  3. Can we muster the effort from the potential users to make the required changes