apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.29k stars 1.19k forks source link

Easier Dataframe API for `map` #11546

Closed jayzhan211 closed 3 months ago

jayzhan211 commented 3 months ago

Dataframe API for map expects us to pass args with make_array

i.e.

map(vec![make_array(vec![lit("a"), lit("b")]), make_array(vec![lit("1"), lit("2")])])

I think we could have easier one with without make_array

map(vec![lit("a"), lit("b")], vec![lit("1"), lit("2")]])

To achieve this we may need to change the arguments of MapFunc from two array to Vec<Expr>, which the first half are keys, another half are values.

_Originally posted by @jayzhan211 in https://github.com/apache/datafusion/pull/11452#discussion_r1684115054_

Dataframe API is somthing used for building Expr

Most of them are written in macro if they have similar pattern, others are individual function, like count_distinct

pub fn count_distinct(expr: Expr) -> Expr {
    Expr::AggregateFunction(datafusion_expr::expr::AggregateFunction::new_udf(
        count_udaf(),
        vec![expr],
        true,
        None,
        None,
        None,
    ))
}

The idea of map is similar to

pub fn map(keys:Vec<Expr>,values:Vec<Expr>) -> Expr {
    let args: Vec<Expr> = concat keys and values
    Expr::ScalarFunction(datafusion_expr::expr::ScalarFunction::new_udf(
        map_udf(),
        vec![args],

    ))
}
goldmedal commented 3 months ago

take

goldmedal commented 3 months ago

@jayzhan211 I have drafted a PR #11560 for it. The design differs from your proposal, but it makes sense to me. Could you take a look at it?

The core concept is providing an expression function and wrapping make_array for the arguments. I think we don't need to change the argument format of MapFunc. However, we need to move map.rs to functions-array because it needs make_array.

Maybe we can rename functions-array to functions-collection? I think there are many implementations that could be shared if we want to support more collection types and usage.

jayzhan211 commented 3 months ago

@jayzhan211 I have drafted a PR #11560 for it. The design differs from your proposal, but it makes sense to me. Could you take a look at it?

The core concept is providing an expression function and wrapping make_array for the arguments. I think we don't need to change the argument format of MapFunc. However, we need to move map.rs to functions-array because it needs make_array.

Maybe we can rename functions-array to functions-collection? I think there are many implementations that could be shared if we want to support more collection types and usage.

My concern is that it might be slower because of additional make_array() computation.

goldmedal commented 3 months ago

My concern is that it might be slower because of additional make_array() computation.

I see. I think I can do some benchmarks for them.

Because I have some concerns mentioned in https://github.com/apache/datafusion/pull/11452#discussion_r1682783513 for changing the MapFunc, I prefer to keep the original design.

jayzhan211 commented 3 months ago

I have some concerns for it. If we make MapFunc to accept one array, it would be used like SELECT map([1,2,3,'a','b','c']) After planning, the input array would be ['1','2','3','a','b','c'] because of the type coercion for array elements. I think the behavior is wrong. If we change the signature of MapFunc, we might need to have another implementation to solve it.

I think in this case we should adjust the coercion rule with coerce_types for MapFunc

jayzhan211 commented 3 months ago

Ideally MapFunc should have the arguments that have minimum transformation and computation cost for creating MapArray, so we can get the most efficient implementation.

  1. k1, v1, k2, v2... is not ideal because we need to arrange it to k1, k2, ... v1, v2 ..., that is why initial MakeMap is beaten by MapFunc.
  2. current implementation has the additional cost of make_array transformation, I guess we could do better without it.
goldmedal commented 3 months ago

My concern is that it might be slower because of additional make_array() computation.

I followed #11526 to create another implementation for map_func, called map_one_func temporarily. I did some benchmarks and found that there are no obvious differences between the performance of the map and map_one functions. I also pushed the updated commits in #11560. You could check it. The benchmark results: (map is the original design and map_one is the new design)

Gnuplot not found, using plotters backend
map_1000                time:   [9.5816 ms 9.6505 ms 9.7221 ms]
                        change: [-3.3395% -1.7212% -0.4746%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

map_one_1000            time:   [9.7250 ms 9.7995 ms 9.8764 ms]
                        change: [-3.4189% -0.6339% +1.3546%] (p = 0.68 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

I ran the benchmark many times, and each time I got similar results. Referring to the result, I think we can just use the original design here. What do you think?

By the way, I found that the compile time to run the benchmark in the core is very long. It takes about 9 minutes. I'm not sure if that's normal. 😢

 jax: ~/git/datafusion/datafusion/core (feature/11546-map-df-api) $ cargo bench --bench map_query_sql
   Compiling datafusion-functions v40.0.0 (/Users/jax/git/datafusion/datafusion/functions)
   Compiling datafusion-functions-array v40.0.0 (/Users/jax/git/datafusion/datafusion/functions-array)
   Compiling datafusion v40.0.0 (/Users/jax/git/datafusion/datafusion/core)
    Finished `bench` profile [optimized] target(s) in 8m 53s
jayzhan211 commented 3 months ago
pub fn map(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
    let keys = make_array(keys);
    let values = make_array(values);
    Expr::ScalarFunction(ScalarFunction::new_udf(
        map_udf(),
        vec![keys, values],
    ))
}

pub fn map_from_array(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
    let keys = make_array(keys);
    let values = make_array(values);
    Expr::ScalarFunction(ScalarFunction::new_udf(
        map_one_udf(),
        vec![keys, values],
    ))
}

It seems they both compute make_array, but I think we can avoid make_array at all.

pub fn map_from_array(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
    let mut args = keys;
    args.extend(values);
    Expr::ScalarFunction(ScalarFunction::new_udf(
        map_one_udf(),
        args,
    ))
}
jayzhan211 commented 3 months ago

By the way, I found that the compile time to run the benchmark in the core is very long. It takes about 9 minutes. I'm not sure if that's normal.

I'm not sure whether it is expected, I guess because core crate has almost all the dependencies therefore many crates to re-complie. I think we could file an issue to track this

jayzhan211 commented 3 months ago

I'm not sure whether it is expected, I guess because core crate has almost all the dependencies therefore many crates to re-complie.

It it not the case for running cargo build in ~/arrow-datafusion 😕

goldmedal commented 3 months ago
    let mut args = keys;
    args.extend(values);

Oops... Sorry about that. I forgot to remove this. I will provide another benchmark result. Many thanks.

goldmedal commented 3 months ago

Here is the benchmark result after removing make_array ( I also pushed a new commit to the draft PR):

map_1000                time:   [10.105 ms 10.168 ms 10.233 ms]
                        change: [+0.0989% +1.5780% +2.8979%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

map_one_1000            time:   [44.081 ms 45.278 ms 46.808 ms]
                        change: [+1.8229% +4.9320% +8.3942%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

I think the result is really bad but I tried to understand why make_array is efficient. I noticed it uses make_scalar_function to handle the ColumnarValue. I guess it could be more efficient than ScalarValue::into_array. https://github.com/apache/datafusion/blob/5da7ab300215c44ca5dc16771091890de22af99b/datafusion/functions-array/src/make_array.rs#L102-L104

I will try to use this way to modify the two version and give another benchmark.

goldmedal commented 3 months ago

Ok, I think it's getting worse.

Gnuplot not found, using plotters backend
map_1000                time:   [9.1289 ms 9.1897 ms 9.2537 ms]
                        change: [-0.7915% +0.1565% +1.1967%] (p = 0.75 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Benchmarking map_one_1000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, or reduce sample count to 70.
map_one_1000            time:   [62.787 ms 63.239 ms 63.732 ms]
                        change: [-2.6609% -1.0620% +0.3899%] (p = 0.17 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)

I also tried to remove

    let mut args = keys;
    args.extend(values);

Just pass an args vector to map_from_array, but it's still slower. I pushed this version to a different branch: https://github.com/goldmedal/datafusion/blob/feature/11546-map-df-api-v4/datafusion/functions/src/core/map.rs If you're interested, you can check it out.

Actually, I found that make_scalar_function uses ColumnarValue::values_to_arrays, so I need to use make_array_inner to aggregate the primitive arrays.

In conclusion, the original design (using make_array) is the fastest.

jayzhan211 commented 3 months ago

The slower one build up StringArray with StringArray::from_iter_values, while the faster one convert ScalarValue to StringArrays and then concat them. I have no idea why the former is 4x slower 😕 .

jayzhan211 commented 3 months ago

In theory, I didn't expect this but I don't understand why. We can move on with make_array approach first.

jayzhan211 commented 3 months ago

Maybe we can rename functions-array to functions-collection? I think there are many implementations that could be shared if we want to support more collection types and usage.

functions-nested? For array, struct, map

goldmedal commented 3 months ago

functions-nested? For array, struct, map

It looks good, but I think we can have another PR for it. It's also related to changing the name of array_expressions feature.