apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.04k stars 1.14k forks source link

Implement a scalar function for creating ScalarValue::Map #11268

Closed goldmedal closed 2 months ago

goldmedal commented 3 months ago

Is your feature request related to a problem or challenge?

After #11224, we have ScalarValue::Map for MapArray. It's better to have a scalar function for it.

Describe the solution you'd like

It would be similar to the make_array and struct functions. I guess it will accept two arrays to construct a map value. Something like

map(key_array, value_array, key_array2, value_array2 ... )

Describe alternatives you've considered

No response

Additional context

No response

goldmedal commented 3 months ago

take

jayzhan211 commented 3 months ago

https://duckdb.org/docs/sql/data_types/map.html

we can follow the syntax from DuckDB, map_from_entries, and convert the syntax MAP {k: v, ...} to function map_from_entries later

alamb commented 3 months ago

https://duckdb.org/docs/sql/data_types/map.html

we can follow the syntax from DuckDB, map_from_entries, and convert the syntax MAP {k: v, ...} to function map_from_entries later

I like this idea very mch

This is also consistent with how struct / named_struct work

Here is named_struct function: https://github.com/apache/datafusion/blob/main/datafusion/functions/src/core/named_struct.rs

Here is the syntax support https://github.com/apache/datafusion/blob/b46d5b7fc65a647d51339f3e6524879aee9810fa/datafusion/functions/src/core/planner.rs#L27-L40

goldmedal commented 3 months ago

Thanks @alamb @jayzhan211

https://duckdb.org/docs/sql/data_types/map.html

we can follow the syntax from DuckDB, map_from_entries, and convert the syntax MAP {k: v, ...} to function map_from_entries later

It's a good idea. I also considered how to implement SQL syntax to support this scalar function, but I had no idea how to do it. It seems UserDefinedSQLPlanner is what I need. It's helpful for me.

I will try to implement the syntax for MAP { x : 1 }. I guess I can follow how the RawDictionaryExpr works.

Besides the SQL syntax, I'm considering what the scalar function would look like. Should we consider the data layout of arrow::MapArray? If needed, I think its parameters could be different from DuckDB's map_from_entries because the data layout of arrow::MapArray allows many slots for a map value, like:

[{a:1,b:2},{c:3},{d:4},{}] 

The function would be used like this:

map_from_entries_array([a, 1, b, 2], [c, 3], [d,4], [])

If we don't need to consider it, we could only allow one element for MapArray. It would be similar to how named_struct() is used:

map_from_entries(a, 1, b, 2, c, 3)

The layout would be

[{a:1,b:2,c:3}]

What do you think?

alamb commented 3 months ago

It's a good idea. I also considered how to implement SQL syntax to support this scalar function, but I had no idea how to do it. It seems UserDefinedSQLPlanner is what I need. It's helpful for me.

I will try to implement the syntax for MAP { x : 1 }. I guess I can follow how the RawDictionaryExpr works.

👍 ❤️

Besides the SQL syntax, I'm considering what the scalar function would look like. Should we consider the data layout of arrow::MapArray? If needed, I think its parameters could be different from DuckDB's map_from_entries because the data layout of arrow::MapArray allows many slots for a map value, like:

[{a:1,b:2},{c:3},{d:4},{}] `

The function would be used like this:

map_from_entries_array([a, 1, b, 2], [c, 3], [d,4], [])

If we don't need to consider it, we could only allow one element for MapArray. It would be similar to how named_struct() is used:

I think following the model for named_struct makes the most sense here

From my reading of https://docs.rs/arrow/latest/arrow/array/struct.MapArray.html the idea is that each element of a MapArray is itself a map with some arbitrary number of key/value pairs

I think we could use https://docs.rs/arrow/latest/arrow/array/builder/struct.MapBuilder.html to create the actual array/value

Syntax Proposal

Maybe we could use a function like make_map to follow the naming convention of make_array: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#make-array

-- create {'foo': 1, 'bar': 2}
select make_map('foo', 1, 'bar', 2);

Implementation suggestion

I recommend doing this in 2 PRs:

  1. Implement test the function (e.g. make_map)
  2. Implement the rewrite from the map literal to make_map

Notes / caveats

I am not sure how complete the MapArray support in arrow-rs is, so it wouldn't surprise me if we hit various unimplemented errors when we tried to do basic operations like compare maps or sort them. I don't think that is a reason not to proceed here, but I wanted to point it out

goldmedal commented 3 months ago

I think following the model for named_struct makes the most sense here

From my reading of https://docs.rs/arrow/latest/arrow/array/struct.MapArray.html the idea is that each element of a MapArray is itself a map with some arbitrary number of key/value pairs

I think we could use https://docs.rs/arrow/latest/arrow/array/builder/struct.MapBuilder.html to create the actual array/value

Syntax Proposal

Maybe we could use a function like make_map to follow the naming convention of make_array: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#make-array

-- create {'foo': 1, 'bar': 2}
select make_map('foo', 1, 'bar', 2);

I have finished a draft function that followed my original design yesterday. I follow how arrow-rs to create a map from strings.

It could be used like DuckDB syntax that accepts two lists to construct a map. (In my design, it could accept many key-value array pairs)

SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30]);

It may not fit the syntax you propose. I will follow the named_struct model to implement another one.

The reason I didn't use MapBuilder is that I can't find a smart way to get the corresponding ArrayBuilder for the arguments ( MapBuilder requires two initial builders for the key and value ). If we have some existing solutions for this issue, I think using MapBuilder is better.

Implementation suggestion

I recommend doing this in 2 PRs:

  1. Implement test the function (e.g. make_map)
  2. Implement the rewrite from the map literal to make_map

I agree with you. I will implement the function in this issue and have a follow-up one for the map literal.

Notes / caveats

I am not sure how complete the MapArray support in arrow-rs is, so it wouldn't surprise me if we hit various unimplemented errors when we tried to do basic operations like compare maps or sort them. I don't think that is a reason not to proceed here, but I wanted to point it out

Thanks for reminding this problem. I think if I occurred this kind of problem, I might implement the required functions on the DataFusion side first.

jayzhan211 commented 3 months ago

The reason I didn't use MapBuilder is that I can't find a smart way to get the corresponding ArrayBuilder for the arguments ( MapBuilder requires two initial builders for the key and value ). If we have some existing solutions for this issue, I think using MapBuilder is better.

The datatype of key and value should be known before invoke, so we can get the corresponding Builder based on the data type

goldmedal commented 3 months ago

The datatype of key and value should be known before invoke, so we can get the corresponding Builder based on the data type

Indeed. I'll try to use MapBuilder. Thanks :)

jayzhan211 commented 3 months ago

The datatype of key and value should be known before invoke, so we can get the corresponding Builder based on the data type

Indeed. I'll try to use MapBuilder. Thanks :)

btw, there is probably a downside for using MapBuilder, since we need to create different builder for different types (therefore many macros) which easily causes code bloat. We have similar issue while building array function before https://github.com/apache/datafusion/issues/7988. I'm not sure what is the best approach yet (probably current draft function is the best), worth to explore it.

goldmedal commented 3 months ago

btw, there is probably a downside for using MapBuilder, since we need to create different builder for different types (therefore many macros) which easily causes code bloat. We have similar issue while building array function before #7988. I'm not sure what is the best approach yet (probably current draft function is the best), worth to explore it.

Besides the code bloat, I'm also concerned about the performance. Because MapBuilder doesn't provide batch append (I think it's not its main purpose), we can only append data entry by entry. I might expect it to have O(n) complexity.

I did some benchmarks to compare the MapBuilder version and the MapArray::from() version. https://github.com/goldmedal/datafusion/blob/95198c17eddd0a708e61e66076a383d5671646b7/datafusion/functions/src/core/map.rs#L132

I haven't tried any optimizations for them. The results are as follows:

// The first run is for warm-up
Generating 1 random key-value pair
Time elapsed in make_map() is: 449.737µs
Time elapsed in make_map_batch() is: 129.096µs

Generating 10 random key-value pairs
Time elapsed in make_map() is: 24.932µs
Time elapsed in make_map_batch() is: 18.007µs

Generating 50 random key-value pairs
Time elapsed in make_map() is: 34.587µs
Time elapsed in make_map_batch() is: 17.037µs

Generating 100 random key-value pairs
Time elapsed in make_map() is: 66.02µs
Time elapsed in make_map_batch() is: 19.027µs

Generating 500 random key-value pairs
Time elapsed in make_map() is: 586.476µs
Time elapsed in make_map_batch() is: 63.832µs

Generating 1000 random key-value pairs
Time elapsed in make_map() is: 722.771µs
Time elapsed in make_map_batch() is: 47.455µs

The scenario for this function may not be for large maps, but I think the MapArray::from() version always has better performance and won't cause code bloat. I prefer this version.

What do you think? @alamb @jayzhan211

jayzhan211 commented 3 months ago

btw, there is probably a downside for using MapBuilder, since we need to create different builder for different types (therefore many macros) which easily causes code bloat. We have similar issue while building array function before #7988. I'm not sure what is the best approach yet (probably current draft function is the best), worth to explore it.

Besides the code bloat, I'm also concerned about the performance. Because MapBuilder doesn't provide batch append (I think it's not its main purpose), we can only append data entry by entry. I might expect it to have O(n) complexity.

I did some benchmarks to compare the MapBuilder version and the MapArray::from() version. https://github.com/goldmedal/datafusion/blob/95198c17eddd0a708e61e66076a383d5671646b7/datafusion/functions/src/core/map.rs#L132

I haven't tried any optimizations for them. The results are as follows:

// The first run is for warm-up
Generating 1 random key-value pair
Time elapsed in make_map() is: 449.737µs
Time elapsed in make_map_batch() is: 129.096µs

Generating 10 random key-value pairs
Time elapsed in make_map() is: 24.932µs
Time elapsed in make_map_batch() is: 18.007µs

Generating 50 random key-value pairs
Time elapsed in make_map() is: 34.587µs
Time elapsed in make_map_batch() is: 17.037µs

Generating 100 random key-value pairs
Time elapsed in make_map() is: 66.02µs
Time elapsed in make_map_batch() is: 19.027µs

Generating 500 random key-value pairs
Time elapsed in make_map() is: 586.476µs
Time elapsed in make_map_batch() is: 63.832µs

Generating 1000 random key-value pairs
Time elapsed in make_map() is: 722.771µs
Time elapsed in make_map_batch() is: 47.455µs

The scenario for this function may not be for large maps, but I think the MapArray::from() version always has better performance and won't cause code bloat. I prefer this version.

What do you think? @alamb @jayzhan211

make_map_batch version is way faster, we should go with that one.

Upd: I tried to remove clone in MapBuilder version, it still seems slower than manually constructed one

goldmedal commented 3 months ago

make_map_batch version is way faster, we should go with that one.

Upd: I tried to remove clone in MapBuilder version, it still seems slower than manually constructed one Here's the revised version of the text with improved grammar:

Thanks for your effort. I think make_map_batch is faster because it accepts two known-type arrays. We only need to rearrange the layout for the MapArray, which requires some Arc cloning. It's a good way to implement the DuckDB syntax:

SELECT MAP(['a', 'b', 'c'], [1, 2, 3]);

However, I found it hard to provide a function like make_map(k1, v1, k2, v2 ...) that is both efficient and simple. I tried two solutions for it:

  1. Aggregate the keys and values, then pass them to make_map_batch:

    fn make_map_from(args: &[ColumnarValue]) -> Result<ColumnarValue> {
        let mut key_buffer = VecDeque::new();
        let mut value_buffer = VecDeque::new();
    
        args.chunks_exact(2)
            .enumerate()
            .for_each(|(_, chunk)| {
                let key = chunk[0].clone();
                let value = chunk[1].clone();
                key_buffer.push_back(key);
                value_buffer.push_back(value);
            });
    
        let key: Vec<_> = key_buffer.into();
        let value: Vec<_> = value_buffer.into();
    
        let key = ColumnarValue::values_to_arrays(&key)?;
        let value = ColumnarValue::values_to_arrays(&value)?;
    
        make_map_batch(&[key[0].clone(), value[0].clone()])
    }

    The codebase is simple, but the performance is terrible. The bottleneck is ColumnarValue::values_to_arrays. This function is convenient but slow. Actually, this version is slower than the MapBuilder one. :(

  2. MapBuilder: It's the faster solution (compared to the above one) for make_map(k1, v1, k2, v2 ...), but the codebase would become complex and large. In my testing code, I hardcoded the types of keys and values to get the value and create the builder. In a real scenario, we would need to match all types and builders to use MapBuilder. I'm not sure if it's a good approach.

    fn make_map(args: &[ColumnarValue]) -> Result<ColumnarValue> {
        let string_builder = StringBuilder::new();
        let int_builder = Int32Builder::with_capacity(args.len() / 2);
        let mut builder =
            MapBuilder::new(None, string_builder, int_builder);
    
        for (_, chunk) in args.chunks_exact(2).enumerate() {
            let key = chunk[0].clone();
            let value = chunk[1].clone();
            match key {
                ColumnarValue::Scalar(ScalarValue::Utf8(Some(key_scalar))) => {
                    builder.keys().append_value(key_scalar)
                }
                _ => {}
            }
    
            match value {
                ColumnarValue::Scalar(ScalarValue::Int32(Some(value_scalar))) => {
                    builder.values().append_value(value_scalar)
                }
                _ => {}
            }
        }
    
        Ok(ColumnarValue::Array(Arc::new(builder.finish())))
    }

In conclusion, I will provide two functions:

For the user-friendly one, I'll choose the first solution to keep the codebase simpler. I found that named_struct also uses ColumnarValue::values_to_arrays to do something similar. https://github.com/apache/datafusion/blob/08c5345e932f1c5c948751e0d06b1fd99e174efa/datafusion/functions/src/core/named_struct.rs#L60

We could suggest that users who intend to get better performance or create a large map use the map() function.

Some appendixs for the benchmark result:

// first run for warm-up
Generating 1 random key-value pairs
Time elapsed in make_map() is: 391.618µs
Time elapsed in make_map_from() is: 224.568µs
Time elapsed in make_map_batch() is: 18.173µs

Generating 2 random key-value pairs
Time elapsed in make_map() is: 19.848µs
Time elapsed in make_map_from() is: 27.726µs
Time elapsed in make_map_batch() is: 32.628µs

Generating 4 random key-value pairs
Time elapsed in make_map() is: 18.495µs
Time elapsed in make_map_from() is: 37.102µs
Time elapsed in make_map_batch() is: 16.399µs

Generating 8 random key-value pairs
Time elapsed in make_map() is: 20.785µs
Time elapsed in make_map_from() is: 48.596µs
Time elapsed in make_map_batch() is: 42.98µs

Generating 10 random key-value pairs
Time elapsed in make_map() is: 26.3µs
Time elapsed in make_map_from() is: 84.601µs
Time elapsed in make_map_batch() is: 15.55µs

// it's not so common for a map use case
Generating 50 random key-value pairs
Time elapsed in make_map() is: 35.751µs
Time elapsed in make_map_from() is: 332.336µs
Time elapsed in make_map_batch() is: 52.425µs

Generating 100 random key-value pairs
Time elapsed in make_map() is: 178.775µs
Time elapsed in make_map_from() is: 508.017µs
Time elapsed in make_map_batch() is: 184.452µs

Generating 1000 random key-value pairs
Time elapsed in make_map() is: 386.535µs
Time elapsed in make_map_from() is: 3.125679ms
Time elapsed in make_map_batch() is: 45.549µs
jayzhan211 commented 2 months ago

what is your code for benchmarking?

Given the code, it seems you clone ColumnValue and push to VecDeque, we can avoid the clone and use Vec instead of VecDeque, since we don't need push_front.

fn make_map_from(args: &[ColumnarValue]) -> Result<ColumnarValue> {
    let mut key_buffer = VecDeque::new();
    let mut value_buffer = VecDeque::new();

    args.chunks_exact(2)
        .enumerate()
        .for_each(|(_, chunk)| {
            let key = chunk[0].clone();
            let value = chunk[1].clone();
            key_buffer.push_back(key);
            value_buffer.push_back(value);
        });

    let key: Vec<_> = key_buffer.into();
    let value: Vec<_> = value_buffer.into();

    let key = ColumnarValue::values_to_arrays(&key)?;
    let value = ColumnarValue::values_to_arrays(&value)?;

    make_map_batch(&[key[0].clone(), value[0].clone()])
}

You only take the first element, but shouldn't we process all the kv?

    make_map_batch(&[key[0].clone(), value[0].clone()])
goldmedal commented 2 months ago

what is your code for benchmarking?

Here's my testing code. https://github.com/goldmedal/datafusion/blob/feature/11268-scalar-funciton-map-v2/datafusion/functions/src/core/map.rs

Given the code, it seems you clone ColumnValue and push to VecDeque, we can avoid the clone and use Vec instead of VecDeque, since we don't need push_front.

Sounds great. I will try it.

You only take the first element, but shouldn't we process all the kv?

    make_map_batch(&[key[0].clone(), value[0].clone()])

oops. It's an unfinished work. I forgot it. Yes, you're right. We should take all. The key and value vec would be

[datafusion/functions/src/core/map.rs:97:5] key.clone() = [
    StringArray
    [
      "key_265622076",
    ],
    StringArray
    [
      "key_-178310949",
    ],
    StringArray
    [
      "key_-2138625601",
    ],
    StringArray
    [
      "key_-814401492",
    ],
]

I think I need to contact them to be one key array and one value array.

jayzhan211 commented 2 months ago

You can also do the benchmarking in datafusion/functions/benches/map.rs.

Add this to cargo.toml, and run with cargo bench --bench map

[[bench]]
harness = false
name = "map"