Closed goldmedal closed 2 months ago
take
https://duckdb.org/docs/sql/data_types/map.html
we can follow the syntax from DuckDB, map_from_entries
, and convert the syntax MAP {k: v, ...}
to function map_from_entries
later
https://duckdb.org/docs/sql/data_types/map.html
we can follow the syntax from DuckDB,
map_from_entries
, and convert the syntaxMAP {k: v, ...}
to functionmap_from_entries
later
I like this idea very mch
This is also consistent with how struct
/ named_struct
work
Here is named_struct
function: https://github.com/apache/datafusion/blob/main/datafusion/functions/src/core/named_struct.rs
Here is the syntax support https://github.com/apache/datafusion/blob/b46d5b7fc65a647d51339f3e6524879aee9810fa/datafusion/functions/src/core/planner.rs#L27-L40
Thanks @alamb @jayzhan211
https://duckdb.org/docs/sql/data_types/map.html
we can follow the syntax from DuckDB,
map_from_entries
, and convert the syntaxMAP {k: v, ...}
to functionmap_from_entries
later
It's a good idea. I also considered how to implement SQL syntax to support this scalar function, but I had no idea how to do it. It seems UserDefinedSQLPlanner
is what I need. It's helpful for me.
I will try to implement the syntax for MAP { x : 1 }
. I guess I can follow how the RawDictionaryExpr
works.
Besides the SQL syntax, I'm considering what the scalar function would look like. Should we consider the data layout of arrow::MapArray
? If needed, I think its parameters could be different from DuckDB's map_from_entries
because the data layout of arrow::MapArray
allows many slots for a map value, like:
[{a:1,b:2},{c:3},{d:4},{}]
The function would be used like this:
map_from_entries_array([a, 1, b, 2], [c, 3], [d,4], [])
If we don't need to consider it, we could only allow one element for MapArray
. It would be similar to how named_struct()
is used:
map_from_entries(a, 1, b, 2, c, 3)
The layout would be
[{a:1,b:2,c:3}]
What do you think?
It's a good idea. I also considered how to implement SQL syntax to support this scalar function, but I had no idea how to do it. It seems
UserDefinedSQLPlanner
is what I need. It's helpful for me.I will try to implement the syntax for
MAP { x : 1 }
. I guess I can follow how theRawDictionaryExpr
works.
👍 ❤️
Besides the SQL syntax, I'm considering what the scalar function would look like. Should we consider the data layout of
arrow::MapArray
? If needed, I think its parameters could be different from DuckDB'smap_from_entries
because the data layout ofarrow::MapArray
allows many slots for a map value, like:[{a:1,b:2},{c:3},{d:4},{}] `
The function would be used like this:
map_from_entries_array([a, 1, b, 2], [c, 3], [d,4], [])
If we don't need to consider it, we could only allow one element for
MapArray
. It would be similar to hownamed_struct()
is used:
I think following the model for named_struct
makes the most sense here
From my reading of https://docs.rs/arrow/latest/arrow/array/struct.MapArray.html the idea is that each element of a MapArray
is itself a map with some arbitrary number of key/value pairs
I think we could use https://docs.rs/arrow/latest/arrow/array/builder/struct.MapBuilder.html to create the actual array/value
Maybe we could use a function like make_map
to follow the naming convention of make_array
: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#make-array
-- create {'foo': 1, 'bar': 2}
select make_map('foo', 1, 'bar', 2);
I recommend doing this in 2 PRs:
make_map
)make_map
I am not sure how complete the MapArray support in arrow-rs is, so it wouldn't surprise me if we hit various unimplemented errors when we tried to do basic operations like compare maps or sort them. I don't think that is a reason not to proceed here, but I wanted to point it out
I think following the model for
named_struct
makes the most sense hereFrom my reading of https://docs.rs/arrow/latest/arrow/array/struct.MapArray.html the idea is that each element of a
MapArray
is itself a map with some arbitrary number of key/value pairsI think we could use https://docs.rs/arrow/latest/arrow/array/builder/struct.MapBuilder.html to create the actual array/value
Syntax Proposal
Maybe we could use a function like
make_map
to follow the naming convention ofmake_array
: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#make-array-- create {'foo': 1, 'bar': 2} select make_map('foo', 1, 'bar', 2);
I have finished a draft function that followed my original design yesterday. I follow how arrow-rs to create a map from strings.
It could be used like DuckDB syntax that accepts two lists to construct a map. (In my design, it could accept many key-value array pairs)
SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30]);
It may not fit the syntax you propose. I will follow the named_struct
model to implement another one.
The reason I didn't use MapBuilder
is that I can't find a smart way to get the corresponding ArrayBuilder for the arguments ( MapBuilder requires two initial builders for the key and value ). If we have some existing solutions for this issue, I think using MapBuilder
is better.
Implementation suggestion
I recommend doing this in 2 PRs:
- Implement test the function (e.g.
make_map
)- Implement the rewrite from the map literal to
make_map
I agree with you. I will implement the function in this issue and have a follow-up one for the map literal.
Notes / caveats
I am not sure how complete the MapArray support in arrow-rs is, so it wouldn't surprise me if we hit various unimplemented errors when we tried to do basic operations like compare maps or sort them. I don't think that is a reason not to proceed here, but I wanted to point it out
Thanks for reminding this problem. I think if I occurred this kind of problem, I might implement the required functions on the DataFusion side first.
The reason I didn't use MapBuilder is that I can't find a smart way to get the corresponding ArrayBuilder for the arguments ( MapBuilder requires two initial builders for the key and value ). If we have some existing solutions for this issue, I think using MapBuilder is better.
The datatype of key and value should be known before invoke
, so we can get the corresponding Builder based on the data type
The datatype of key and value should be known before
invoke
, so we can get the corresponding Builder based on the data type
Indeed. I'll try to use MapBuilder. Thanks :)
The datatype of key and value should be known before
invoke
, so we can get the corresponding Builder based on the data typeIndeed. I'll try to use MapBuilder. Thanks :)
btw, there is probably a downside for using MapBuilder, since we need to create different builder for different types (therefore many macros) which easily causes code bloat. We have similar issue while building array function before https://github.com/apache/datafusion/issues/7988. I'm not sure what is the best approach yet (probably current draft function is the best), worth to explore it.
btw, there is probably a downside for using MapBuilder, since we need to create different builder for different types (therefore many macros) which easily causes code bloat. We have similar issue while building array function before #7988. I'm not sure what is the best approach yet (probably current draft function is the best), worth to explore it.
Besides the code bloat, I'm also concerned about the performance. Because MapBuilder
doesn't provide batch append (I think it's not its main purpose), we can only append data entry by entry. I might expect it to have O(n) complexity.
I did some benchmarks to compare the MapBuilder
version and the MapArray::from()
version.
https://github.com/goldmedal/datafusion/blob/95198c17eddd0a708e61e66076a383d5671646b7/datafusion/functions/src/core/map.rs#L132
I haven't tried any optimizations for them. The results are as follows:
// The first run is for warm-up
Generating 1 random key-value pair
Time elapsed in make_map() is: 449.737µs
Time elapsed in make_map_batch() is: 129.096µs
Generating 10 random key-value pairs
Time elapsed in make_map() is: 24.932µs
Time elapsed in make_map_batch() is: 18.007µs
Generating 50 random key-value pairs
Time elapsed in make_map() is: 34.587µs
Time elapsed in make_map_batch() is: 17.037µs
Generating 100 random key-value pairs
Time elapsed in make_map() is: 66.02µs
Time elapsed in make_map_batch() is: 19.027µs
Generating 500 random key-value pairs
Time elapsed in make_map() is: 586.476µs
Time elapsed in make_map_batch() is: 63.832µs
Generating 1000 random key-value pairs
Time elapsed in make_map() is: 722.771µs
Time elapsed in make_map_batch() is: 47.455µs
The scenario for this function may not be for large maps, but I think the MapArray::from()
version always has better performance and won't cause code bloat. I prefer this version.
What do you think? @alamb @jayzhan211
btw, there is probably a downside for using MapBuilder, since we need to create different builder for different types (therefore many macros) which easily causes code bloat. We have similar issue while building array function before #7988. I'm not sure what is the best approach yet (probably current draft function is the best), worth to explore it.
Besides the code bloat, I'm also concerned about the performance. Because
MapBuilder
doesn't provide batch append (I think it's not its main purpose), we can only append data entry by entry. I might expect it to have O(n) complexity.I did some benchmarks to compare the
MapBuilder
version and theMapArray::from()
version. https://github.com/goldmedal/datafusion/blob/95198c17eddd0a708e61e66076a383d5671646b7/datafusion/functions/src/core/map.rs#L132I haven't tried any optimizations for them. The results are as follows:
// The first run is for warm-up Generating 1 random key-value pair Time elapsed in make_map() is: 449.737µs Time elapsed in make_map_batch() is: 129.096µs Generating 10 random key-value pairs Time elapsed in make_map() is: 24.932µs Time elapsed in make_map_batch() is: 18.007µs Generating 50 random key-value pairs Time elapsed in make_map() is: 34.587µs Time elapsed in make_map_batch() is: 17.037µs Generating 100 random key-value pairs Time elapsed in make_map() is: 66.02µs Time elapsed in make_map_batch() is: 19.027µs Generating 500 random key-value pairs Time elapsed in make_map() is: 586.476µs Time elapsed in make_map_batch() is: 63.832µs Generating 1000 random key-value pairs Time elapsed in make_map() is: 722.771µs Time elapsed in make_map_batch() is: 47.455µs
The scenario for this function may not be for large maps, but I think the
MapArray::from()
version always has better performance and won't cause code bloat. I prefer this version.What do you think? @alamb @jayzhan211
make_map_batch
version is way faster, we should go with that one.
Upd: I tried to remove clone in MapBuilder version, it still seems slower than manually constructed one
make_map_batch
version is way faster, we should go with that one.Upd: I tried to remove clone in MapBuilder version, it still seems slower than manually constructed one Here's the revised version of the text with improved grammar:
Thanks for your effort. I think make_map_batch
is faster because it accepts two known-type arrays. We only need to rearrange the layout for the MapArray
, which requires some Arc
cloning. It's a good way to implement the DuckDB syntax:
SELECT MAP(['a', 'b', 'c'], [1, 2, 3]);
However, I found it hard to provide a function like make_map(k1, v1, k2, v2 ...)
that is both efficient and simple. I tried two solutions for it:
Aggregate the keys and values, then pass them to make_map_batch
:
fn make_map_from(args: &[ColumnarValue]) -> Result<ColumnarValue> {
let mut key_buffer = VecDeque::new();
let mut value_buffer = VecDeque::new();
args.chunks_exact(2)
.enumerate()
.for_each(|(_, chunk)| {
let key = chunk[0].clone();
let value = chunk[1].clone();
key_buffer.push_back(key);
value_buffer.push_back(value);
});
let key: Vec<_> = key_buffer.into();
let value: Vec<_> = value_buffer.into();
let key = ColumnarValue::values_to_arrays(&key)?;
let value = ColumnarValue::values_to_arrays(&value)?;
make_map_batch(&[key[0].clone(), value[0].clone()])
}
The codebase is simple, but the performance is terrible. The bottleneck is ColumnarValue::values_to_arrays
. This function is convenient but slow. Actually, this version is slower than the MapBuilder
one. :(
MapBuilder:
It's the faster solution (compared to the above one) for make_map(k1, v1, k2, v2 ...)
, but the codebase would become complex and large. In my testing code, I hardcoded the types of keys and values to get the value and create the builder. In a real scenario, we would need to match all types and builders to use MapBuilder. I'm not sure if it's a good approach.
fn make_map(args: &[ColumnarValue]) -> Result<ColumnarValue> {
let string_builder = StringBuilder::new();
let int_builder = Int32Builder::with_capacity(args.len() / 2);
let mut builder =
MapBuilder::new(None, string_builder, int_builder);
for (_, chunk) in args.chunks_exact(2).enumerate() {
let key = chunk[0].clone();
let value = chunk[1].clone();
match key {
ColumnarValue::Scalar(ScalarValue::Utf8(Some(key_scalar))) => {
builder.keys().append_value(key_scalar)
}
_ => {}
}
match value {
ColumnarValue::Scalar(ScalarValue::Int32(Some(value_scalar))) => {
builder.values().append_value(value_scalar)
}
_ => {}
}
}
Ok(ColumnarValue::Array(Arc::new(builder.finish())))
}
In conclusion, I will provide two functions:
map(['a', 'b', 'c'], [1, 2, 3])
make_map(k1, v1, k2, v2 ...)
For the user-friendly one, I'll choose the first solution to keep the codebase simpler. I found that named_struct
also uses ColumnarValue::values_to_arrays
to do something similar.
https://github.com/apache/datafusion/blob/08c5345e932f1c5c948751e0d06b1fd99e174efa/datafusion/functions/src/core/named_struct.rs#L60
We could suggest that users who intend to get better performance or create a large map use the map()
function.
Some appendixs for the benchmark result:
// first run for warm-up
Generating 1 random key-value pairs
Time elapsed in make_map() is: 391.618µs
Time elapsed in make_map_from() is: 224.568µs
Time elapsed in make_map_batch() is: 18.173µs
Generating 2 random key-value pairs
Time elapsed in make_map() is: 19.848µs
Time elapsed in make_map_from() is: 27.726µs
Time elapsed in make_map_batch() is: 32.628µs
Generating 4 random key-value pairs
Time elapsed in make_map() is: 18.495µs
Time elapsed in make_map_from() is: 37.102µs
Time elapsed in make_map_batch() is: 16.399µs
Generating 8 random key-value pairs
Time elapsed in make_map() is: 20.785µs
Time elapsed in make_map_from() is: 48.596µs
Time elapsed in make_map_batch() is: 42.98µs
Generating 10 random key-value pairs
Time elapsed in make_map() is: 26.3µs
Time elapsed in make_map_from() is: 84.601µs
Time elapsed in make_map_batch() is: 15.55µs
// it's not so common for a map use case
Generating 50 random key-value pairs
Time elapsed in make_map() is: 35.751µs
Time elapsed in make_map_from() is: 332.336µs
Time elapsed in make_map_batch() is: 52.425µs
Generating 100 random key-value pairs
Time elapsed in make_map() is: 178.775µs
Time elapsed in make_map_from() is: 508.017µs
Time elapsed in make_map_batch() is: 184.452µs
Generating 1000 random key-value pairs
Time elapsed in make_map() is: 386.535µs
Time elapsed in make_map_from() is: 3.125679ms
Time elapsed in make_map_batch() is: 45.549µs
what is your code for benchmarking?
Given the code, it seems you clone ColumnValue and push to VecDeque, we can avoid the clone and use Vec instead of VecDeque, since we don't need push_front
.
fn make_map_from(args: &[ColumnarValue]) -> Result<ColumnarValue> {
let mut key_buffer = VecDeque::new();
let mut value_buffer = VecDeque::new();
args.chunks_exact(2)
.enumerate()
.for_each(|(_, chunk)| {
let key = chunk[0].clone();
let value = chunk[1].clone();
key_buffer.push_back(key);
value_buffer.push_back(value);
});
let key: Vec<_> = key_buffer.into();
let value: Vec<_> = value_buffer.into();
let key = ColumnarValue::values_to_arrays(&key)?;
let value = ColumnarValue::values_to_arrays(&value)?;
make_map_batch(&[key[0].clone(), value[0].clone()])
}
You only take the first element, but shouldn't we process all the kv?
make_map_batch(&[key[0].clone(), value[0].clone()])
what is your code for benchmarking?
Here's my testing code. https://github.com/goldmedal/datafusion/blob/feature/11268-scalar-funciton-map-v2/datafusion/functions/src/core/map.rs
Given the code, it seems you clone ColumnValue and push to VecDeque, we can avoid the clone and use Vec instead of VecDeque, since we don't need
push_front
.
Sounds great. I will try it.
You only take the first element, but shouldn't we process all the kv?
make_map_batch(&[key[0].clone(), value[0].clone()])
oops. It's an unfinished work. I forgot it. Yes, you're right. We should take all. The key and value vec would be
[datafusion/functions/src/core/map.rs:97:5] key.clone() = [
StringArray
[
"key_265622076",
],
StringArray
[
"key_-178310949",
],
StringArray
[
"key_-2138625601",
],
StringArray
[
"key_-814401492",
],
]
I think I need to contact them to be one key array and one value array.
You can also do the benchmarking in datafusion/functions/benches/map.rs
.
Add this to cargo.toml, and run with cargo bench --bench map
[[bench]]
harness = false
name = "map"
Is your feature request related to a problem or challenge?
After #11224, we have
ScalarValue::Map
forMapArray
. It's better to have a scalar function for it.Describe the solution you'd like
It would be similar to the
make_array
andstruct
functions. I guess it will accept two arrays to construct a map value. Something likeDescribe alternatives you've considered
No response
Additional context
No response