Closed jayzhan211 closed 3 months ago
take
@jayzhan211 I have drafted a PR #11560 for it. The design differs from your proposal, but it makes sense to me. Could you take a look at it?
The core concept is providing an expression function and wrapping make_array
for the arguments. I think we don't need to change the argument format of MapFunc
. However, we need to move map.rs
to functions-array
because it needs make_array
.
Maybe we can rename functions-array
to functions-collection
? I think there are many implementations that could be shared if we want to support more collection types and usage.
@jayzhan211 I have drafted a PR #11560 for it. The design differs from your proposal, but it makes sense to me. Could you take a look at it?
The core concept is providing an expression function and wrapping
make_array
for the arguments. I think we don't need to change the argument format ofMapFunc
. However, we need to movemap.rs
tofunctions-array
because it needsmake_array
.Maybe we can rename
functions-array
tofunctions-collection
? I think there are many implementations that could be shared if we want to support more collection types and usage.
My concern is that it might be slower because of additional make_array()
computation.
My concern is that it might be slower because of additional
make_array()
computation.
I see. I think I can do some benchmarks for them.
Because I have some concerns mentioned in https://github.com/apache/datafusion/pull/11452#discussion_r1682783513 for changing the MapFunc
, I prefer to keep the original design.
I have some concerns for it. If we make MapFunc to accept one array, it would be used like SELECT map([1,2,3,'a','b','c']) After planning, the input array would be ['1','2','3','a','b','c'] because of the type coercion for array elements. I think the behavior is wrong. If we change the signature of MapFunc, we might need to have another implementation to solve it.
I think in this case we should adjust the coercion rule with coerce_types
for MapFunc
Ideally MapFunc should have the arguments that have minimum transformation and computation cost for creating MapArray, so we can get the most efficient implementation.
make_array
transformation, I guess we could do better without it.My concern is that it might be slower because of additional
make_array()
computation.
I followed #11526 to create another implementation for map_func
, called map_one_func
temporarily. I did some benchmarks and found that there are no obvious differences between the performance of the map
and map_one
functions. I also pushed the updated commits in #11560. You could check it.
The benchmark results: (map is the original design and map_one is the new design)
Gnuplot not found, using plotters backend
map_1000 time: [9.5816 ms 9.6505 ms 9.7221 ms]
change: [-3.3395% -1.7212% -0.4746%] (p = 0.02 < 0.05)
Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
map_one_1000 time: [9.7250 ms 9.7995 ms 9.8764 ms]
change: [-3.4189% -0.6339% +1.3546%] (p = 0.68 > 0.05)
No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
I ran the benchmark many times, and each time I got similar results. Referring to the result, I think we can just use the original design here. What do you think?
By the way, I found that the compile time to run the benchmark in the core is very long. It takes about 9 minutes. I'm not sure if that's normal. 😢
jax: ~/git/datafusion/datafusion/core (feature/11546-map-df-api) $ cargo bench --bench map_query_sql
Compiling datafusion-functions v40.0.0 (/Users/jax/git/datafusion/datafusion/functions)
Compiling datafusion-functions-array v40.0.0 (/Users/jax/git/datafusion/datafusion/functions-array)
Compiling datafusion v40.0.0 (/Users/jax/git/datafusion/datafusion/core)
Finished `bench` profile [optimized] target(s) in 8m 53s
pub fn map(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
let keys = make_array(keys);
let values = make_array(values);
Expr::ScalarFunction(ScalarFunction::new_udf(
map_udf(),
vec![keys, values],
))
}
pub fn map_from_array(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
let keys = make_array(keys);
let values = make_array(values);
Expr::ScalarFunction(ScalarFunction::new_udf(
map_one_udf(),
vec![keys, values],
))
}
It seems they both compute make_array
, but I think we can avoid make_array
at all.
pub fn map_from_array(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
let mut args = keys;
args.extend(values);
Expr::ScalarFunction(ScalarFunction::new_udf(
map_one_udf(),
args,
))
}
By the way, I found that the compile time to run the benchmark in the core is very long. It takes about 9 minutes. I'm not sure if that's normal.
I'm not sure whether it is expected, I guess because core
crate has almost all the dependencies therefore many crates to re-complie. I think we could file an issue to track this
I'm not sure whether it is expected, I guess because core crate has almost all the dependencies therefore many crates to re-complie.
It it not the case for running cargo build
in ~/arrow-datafusion
😕
let mut args = keys; args.extend(values);
Oops... Sorry about that. I forgot to remove this. I will provide another benchmark result. Many thanks.
Here is the benchmark result after removing make_array
( I also pushed a new commit to the draft PR):
map_1000 time: [10.105 ms 10.168 ms 10.233 ms]
change: [+0.0989% +1.5780% +2.8979%] (p = 0.02 < 0.05)
Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
map_one_1000 time: [44.081 ms 45.278 ms 46.808 ms]
change: [+1.8229% +4.9320% +8.3942%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
3 (3.00%) high mild
6 (6.00%) high severe
I think the result is really bad but I tried to understand why make_array
is efficient. I noticed it uses make_scalar_function
to handle the ColumnarValue
. I guess it could be more efficient than ScalarValue::into_array
.
https://github.com/apache/datafusion/blob/5da7ab300215c44ca5dc16771091890de22af99b/datafusion/functions-array/src/make_array.rs#L102-L104
I will try to use this way to modify the two version and give another benchmark.
Ok, I think it's getting worse.
Gnuplot not found, using plotters backend
map_1000 time: [9.1289 ms 9.1897 ms 9.2537 ms]
change: [-0.7915% +0.1565% +1.1967%] (p = 0.75 > 0.05)
No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
Benchmarking map_one_1000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, or reduce sample count to 70.
map_one_1000 time: [62.787 ms 63.239 ms 63.732 ms]
change: [-2.6609% -1.0620% +0.3899%] (p = 0.17 > 0.05)
No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
I also tried to remove
let mut args = keys;
args.extend(values);
Just pass an args vector to map_from_array
, but it's still slower. I pushed this version to a different branch: https://github.com/goldmedal/datafusion/blob/feature/11546-map-df-api-v4/datafusion/functions/src/core/map.rs
If you're interested, you can check it out.
Actually, I found that make_scalar_function
uses ColumnarValue::values_to_arrays
, so I need to use make_array_inner
to aggregate the primitive arrays.
In conclusion, the original design (using make_array
) is the fastest.
The slower one build up StringArray
with StringArray::from_iter_values
, while the faster one convert ScalarValue to StringArrays and then concat them. I have no idea why the former is 4x slower 😕 .
In theory, I didn't expect this but I don't understand why. We can move on with make_array
approach first.
Maybe we can rename functions-array to functions-collection? I think there are many implementations that could be shared if we want to support more collection types and usage.
functions-nested? For array, struct, map
functions-nested? For array, struct, map
It looks good, but I think we can have another PR for it. It's also related to changing the name of array_expressions
feature.
Dataframe API for
map
expects us to pass args withmake_array
i.e.
I think we could have easier one with without
make_array
To achieve this we may need to change the arguments of
MapFunc
from two array toVec<Expr>
, which the first half arekeys
, another half arevalues
._Originally posted by @jayzhan211 in https://github.com/apache/datafusion/pull/11452#discussion_r1684115054_
Dataframe API is somthing used for building
Expr
Most of them are written in macro if they have similar pattern, others are individual function, like
count_distinct
The idea of
map
is similar to