apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.87k stars 1.11k forks source link

persistence for `ExecutionContextState`? #755

Open jimexist opened 3 years ago

jimexist commented 3 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and why for this feature, in addition to the what)

i wonder if there's anyway for ExecutionContextState to be persisted? So that it can be persisted across binary startup

Describe the solution you'd like A clear and concise description of what you want to happen.

SQLite would be a good choice

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

alamb commented 3 years ago

Maybe using serde might be a good choice so that users could choose what particular persistence mechanism they wanted.

Dandandan commented 3 years ago

Serde sounds like a good option - I would not add SQLite to DataFusion.

What is the exact use case though? I think table / metadata is commonly kept in a data catalog / metastore and configuration is given on startup of the session. Any things other than that? AFAIK Spark doesn't give an option like this?

EricJoy2048 commented 2 years ago

Serde sounds like a good option - I would not add SQLite to DataFusion.

What is the exact use case though? I think table / metadata is commonly kept in a data catalog / megastore and configuration is given on startup of the session. Any things other than that? AFAIK Spark doesn't give an option like this?

Some times we want to create table with SQL, and still want to use the table when the session is restart.

alamb commented 2 years ago

It sounds like a usecase would be to save all the table providers -- since they can be user provided (in other Rust code) I am not sure serializing them in the core of DataFusion makes much sense.

Adding some sort of table / session persistence to datafusion-cli (and other users of the core DataFusion) would make sense to me

EricJoy2048 commented 2 years ago

It sounds like a usecase would be to save all the table providers -- since they can be user provided (in other Rust code) I am not sure serializing them in the core of DataFusion makes much sense.

Adding some sort of table / session persistence to datafusion-cli (and other users of the core DataFusion) would make sense to me

ExecutionContext only support create catalog from default. I want to unify the management of catalog and schema information externally, and this information can be shared by different ExecutionContexts, it is impossible to do so now. If the content in ExecutionContextState can be init through the new(state: Arc<Mutex<ExecutionContextState>>) method, then we can manage this information in the ballista scheduler, and send this information to the ballista executor, where every datafusion ExecutionContext created in the ballista executor can be Have the same ExecutionContextState content.

image

image

houqp commented 2 years ago

I also think serde would be a good fit for what we are trying to serialize here.