apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine
https://datafusion.apache.org/ballista
Apache License 2.0
1.53k stars 194 forks source link

Trim down `BallistaConfig` #1104

Open milenkovicm opened 1 week ago

milenkovicm commented 1 week ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I would like to propose to trim down BallistaConfig, make SessionConfig as main way to configure Ballista and make BallistaConfig just a part of it, with ballista specific configurations

Describe the solution you'd like

merge following configuration:

+-------------------------------------------------------------------------+---------------------------+
| name                                                                    | value                     |
+-------------------------------------------------------------------------+---------------------------+
| ballista.batch.size                                                     | 8192                      |
| datafusion.execution.batch_size                                         | 8192                      |
| ballista.collect_statistics                                             | false                     |
| datafusion.execution.collect_statistics                                 | false                     |
| ballista.optimizer.hash_join_single_partition_threshold                 | 1048576                   |
| datafusion.optimizer.hash_join_single_partition_threshold               | 1048576                   |
| ballista.parquet.pruning                                                | true                      |
| datafusion.execution.parquet.pruning                                    | true                      |
| ballista.repartition.aggregations                                       | true                      |
| datafusion.optimizer.repartition_aggregations                           | true                      |
| ballista.repartition.joins                                              | true                      |
| datafusion.optimizer.repartition_joins                                  | true                      |
| ballista.repartition.windows                                            | true                      |
| datafusion.optimizer.repartition_windows                                | true                      |
| ballista.with_information_schema                                        | false                     |
| datafusion.catalog.information_schema                                   | true                      |
| ballista.shuffle.partitions                                             | 16                        |
| datafusion.execution.target_partitions                                  | 8                         |
| ballista.standalone.parallelism                                         | 8                         |
| datafusion.execution.target_partitions                                  | 8                         |
+-------------------------------------------------------------------------+---------------------------+

If we check /ballista/scheduler/src/state/session_manager.rs

pub fn create_datafusion_context(
    ballista_config: &BallistaConfig,
    session_builder: SessionBuilder,
) -> Arc<SessionContext> {
    let config =
        SessionConfig::from_string_hash_map(&ballista_config.settings().clone()).unwrap();
    let config = config
        .with_target_partitions(ballista_config.default_shuffle_partitions())
        .with_batch_size(ballista_config.default_batch_size())
        .with_repartition_joins(ballista_config.repartition_joins())
        .with_repartition_aggregations(ballista_config.repartition_aggregations())
        .with_repartition_windows(ballista_config.repartition_windows())
        .with_collect_statistics(ballista_config.collect_statistics())
        .with_parquet_pruning(ballista_config.parquet_pruning())
        .set_usize(
            "datafusion.optimizer.hash_join_single_partition_threshold",
            ballista_config.hash_join_single_partition_threshold(),
        )
        .set_bool("datafusion.optimizer.enable_round_robin_repartition", false);
    let session_state = session_builder(config);
    Arc::new(SessionContext::new_with_state(session_state))
}

ballista specific configuration, which should be preserved:


+-------------------------------------------------------------------------+---------------------------+
| name                                                                    | value                     |
+-------------------------------------------------------------------------+---------------------------+
| ballista.grpc_client_max_message_size                                   | 16777216                  |
| ballista.job.name                                                       |                           |
+-------------------------------------------------------------------------+---------------------------+

we can see that configuration map 1 to 1 to datafusion configuration

Note: datafusion.optimizer.enable_round_robin_repartition has to be false false ;

Describe alternatives you've considered

keep BallistaConfig but at the moment i see no benefit of keeping it or scenarios which SessionConfig cant support

Additional context

As it stands with introduction of SessionContextExt, BallistaConfig has been removed from all public interfaces, with this change BallistaConfig will probably be removed from most internal interfaces as well.

milenkovicm commented 1 week ago

will take this once #1103 and #1099 get merged