apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.15k stars 1.16k forks source link

Easier way to convert between `ParquetExec` and `ParquetExecBuilder` #12737

Closed alamb closed 1 week ago

alamb commented 2 weeks ago

Is your feature request related to a problem or challenge?

In InfluxDb we have several optimizer passes that rearrange ParquetExec (e.g. split the files up into multiple new ParquetExecs or break a single ParquetExec up into multiple ones)

Doing this at the moment is somewhat cumbersome, for example: https://github.com/influxdata/influxdb3_core/blob/1eaa4ed5ea147bc24db98d9686e457c124dfd5b7/iox_query/src/physical_optimizer/predicate_pushdown.rs#L55-L77

Describe the solution you'd like

I would like an easier way to go from ParquetExec --> ParquetExecBuilder which can be manipulated and then turned back into a parquet exec

Describe alternatives you've considered

I suggest:

let parquet_exec: ParquetExec = ...;
// convert parquet_exec into a builder
let mut builder = ParquetExecBuilder::from(parquet_exec); // maybe also support parquet_exec.into_builder()
// ... modify builder
let paruet_exec = builder.build() // turn back to ParquetExec

Bonus points if we can make it work for an Arc<ParquetExec> too:

let parquet_exec: Arc<ParquetExec> = ...;
// convert parquet_exec into a builder
let mut builder = ParquetExecBuilder::from(parquet_exec); 
...

Additional context

This is the root usecase behind @NGA-TRAN 's proposal in https://github.com/apache/datafusion/pull/12726

alamb commented 2 weeks ago

I plan to work in this in the next day or two